• Implicit String-Literal Concatenation

    From Lawrence D'Oliveiro@21:1/5 to All on Sat Feb 24 23:05:47 2024
    I like using this for long strings:

    fputs
    (
    "When an uncleft or a bulkbit wins one or more bernstonebits above\n"
    "its own, it takes on a backward lading. When it loses one or\n"
    "more, it takes on a forward lading. Such a mote is called a\n"
    "*farer*, for that the drag between unlike ladings flits it. When\n"
    "bernstonebits flit by themselves, it may be as a bolt of\n"
    "lightning, a spark off some faststanding chunk, or the everyday\n"
    "flow of bernstoneness through wires.\n",
    stdout
    );

    Of languages that derive ideas from C, only C++ and Python seem to have
    kept this. Java, JavaScript and PHP have not, for some reason.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Lawrence D'Oliveiro on Sun Feb 25 17:38:38 2024
    On 25.02.2024 00:05, Lawrence D'Oliveiro wrote:
    I like using this for long strings:

    fputs
    (
    "When an uncleft or a bulkbit wins one or more bernstonebits above\n"
    "its own, it takes on a backward lading. When it loses one or\n"
    "more, it takes on a forward lading. Such a mote is called a\n"
    "*farer*, for that the drag between unlike ladings flits it. When\n"
    "bernstonebits flit by themselves, it may be as a bolt of\n"
    "lightning, a spark off some faststanding chunk, or the everyday\n"
    "flow of bernstoneness through wires.\n",
    stdout
    );

    I also liked to be able to use this feature _in some cases_ in C++.

    Not in the given case, though, where I like to more clearly see the
    newlines, so I'd prefer cout << "..." << endl


    Of languages that derive ideas from C, only C++ and Python seem to have
    kept this. Java, JavaScript and PHP have not, for some reason.

    In Java you have at least the string concatenation operator + which
    is, IMO, pretty good for that line structuring across source lines.

    In Awk (another "C like"), string concatenations have no visible
    operators so we can for example write print "Hell" "o " "world"
    But since lines have a much more restricted definition you cannot
    without line continuation escape spread these strings across many
    lines. (It's not too bad to add a terminating '\' where desired.)

    As far as you're asking "for some reason", I could just speculate
    (and abstain).

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Blue-Maned_Hawk@21:1/5 to All on Sun Feb 25 16:45:20 2024
    I've used this to make strings with embedded newlines look in the source
    file closer to how they'd look on output.


    --
    Blue-Maned_Hawk│shortens to Hawk│/ blu.mɛin.dʰak/ │he/him/his/himself/Mr. blue-maned_hawk.srht.site
    2017 called, but i couldn't understand what they were saying over all the screams.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Janis Papanagnou on Sun Feb 25 20:43:31 2024
    On Sun, 25 Feb 2024 17:38:38 +0100, Janis Papanagnou wrote:

    In Java you have at least the string concatenation operator + which is,
    IMO, pretty good for that line structuring across source lines.

    Implicit concatenation works well in Python because you also have the
    “%” operator overloaded to perform printf-style formatting with a
    string. If you had to use “+” then, because that binds less tightly
    than “%”, you would have to have parentheses as well, which are
    unnecessary with implicit concatenation. E.g.

    # depreciation entries
    sql.cursor.execute \
    (
    "insert into payments set when_made = %(when_made)s,"
    " description = %(description)s, other_party_name = \"\","
    " amount = %(amount)d, kind = \"D\", tax_year = %(tax_year)d"
    %
    {
    "when_made" : end_for_tax_year(tax_year) - 1,
    "description" :
    sql_string
    (
    "%s: %s $%s at %d%% from %s"
    %
    (
    entry["description"],
    entry["method"],
    format_amount(entry["initial_value"]),
    entry["rate"],
    format_date(entry["when_purchased"]),
    )
    ),
    "amount" : - entry["amount"],
    "tax_year" : tax_year,
    }
    )

    Or, for added fun, how about parameterizing a format:

    num_format = "%%.%dg" % nr_digits
    ...
    for axis in range(3) :
    out.write \
    (
    " (%s, %s),\n"
    %
    (num_format, num_format)
    %
    (
    min(v.co[axis] for v in the_mesh.vertices),
    max(v.co[axis] for v in the_mesh.vertices)
    )
    )
    #end for

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Lawrence D'Oliveiro on Sun Feb 25 20:25:09 2024
    On Sat, 24 Feb 2024 23:05:47 -0000 (UTC), Lawrence D'Oliveiro wrote:

    Of languages that derive ideas from C, only C++ and Python seem to have
    kept this. Java, JavaScript and PHP have not, for some reason.

    I’d forgotten to check Perl; it doesn’t have implicit concatenation
    either.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Lawrence D'Oliveiro on Sun Feb 25 21:20:13 2024
    On 25/02/2024 20:43, Lawrence D'Oliveiro wrote:
    On Sun, 25 Feb 2024 17:38:38 +0100, Janis Papanagnou wrote:

    In Java you have at least the string concatenation operator + which is,
    IMO, pretty good for that line structuring across source lines.

    Implicit concatenation works well in Python because you also have the
    “%” operator overloaded to perform printf-style formatting with a
    string. If you had to use “+” then, because that binds less tightly
    than “%”,

    You mean it binds less tightly than implicit concatenation? So that:

    "abc" % "def" "ghi" means "abc" % ("def" "ghi")
    "abc" % "def" + "ghi" means ("abc" % "def") "ghi"


    you would have to have parentheses as well, which are
    unnecessary with implicit concatenation. E.g.

    # depreciation entries
    sql.cursor.execute \
    (
    "insert into payments set when_made = %(when_made)s,"
    " description = %(description)s, other_party_name = \"\","
    " amount = %(amount)d, kind = \"D\", tax_year = %(tax_year)d"
    %
    {
    "when_made" : end_for_tax_year(tax_year) - 1,
    "description" :
    sql_string
    (
    "%s: %s $%s at %d%% from %s"
    %
    (
    entry["description"],
    entry["method"],
    format_amount(entry["initial_value"]),
    entry["rate"],
    format_date(entry["when_purchased"]),
    )
    ),
    "amount" : - entry["amount"],
    "tax_year" : tax_year,
    }
    )

    Or, for added fun, how about parameterizing a format:

    num_format = "%%.%dg" % nr_digits
    ...
    for axis in range(3) :
    out.write \
    (
    " (%s, %s),\n"
    %
    (num_format, num_format)
    %
    (
    min(v.co[axis] for v in the_mesh.vertices),
    max(v.co[axis] for v in the_mesh.vertices)
    )
    )
    #end for

    Although I can't see it made much difference here. Is this an example of
    how bad it can be without implicit concatenation, or is it this
    complicated despite that?

    Since I can't see any "+" operators between strings, yet what follows
    "%" is usually something starting with "(" or "{", not a string constant.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Mon Feb 26 20:31:18 2024
    On Mon, 26 Feb 2024 21:12:39 +0100, Łukasz 'Maly' Ostrowski wrote:

    Java (Text Blocks):
    String s = """
    multi line string""";

    Python has those, too. I use them sometimes. Generally I’m not fond of
    them, because I think they’re wrongly defined.

    JavaScript (Template Literal):
    let s = `multi line string`;

    I think Python has something like that now, too. F-strings?

    Still more convenient than C.

    I still like having the choice of implicit concatenation, because then I
    fully control what appears in the string.

    Tip: I have Emacs macros defined to strip and add the quoting/escaping,
    because I find the strings are easier to edit without that.

    PHP? Don't care about PHP, it's shit, not even checking, most likely
    some kind of a Perl-ish <<<EOF expression.

    PHP is shit, not because of what it copied from Perl, but from what it
    didn’t copy. Nowadays it is trying to copy from Python, and it is making
    the same mistake.

    The <<EOD construct that Perl has comes from POSIX shells, and it is very useful in both places. Bash also adds a <<<-construct.

    Question: How would you do two separate <<-strings in the same shell
    command?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Lawrence D'Oliveiro on Mon Feb 26 20:42:42 2024
    On 2024-02-24, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    I like using this for long strings:

    fputs
    (
    "When an uncleft or a bulkbit wins one or more bernstonebits above\n"
    "its own, it takes on a backward lading. When it loses one or\n"
    "more, it takes on a forward lading. Such a mote is called a\n"
    "*farer*, for that the drag between unlike ladings flits it. When\n"
    "bernstonebits flit by themselves, it may be as a bolt of\n"
    "lightning, a spark off some faststanding chunk, or the everyday\n"
    "flow of bernstoneness through wires.\n",
    stdout
    );

    Of languages that derive ideas from C, only C++ and Python seem to have
    kept this. Java, JavaScript and PHP have not, for some reason.

    Implicit string catenation means you need punctuation to separate
    elements that are not catenated.

    It's a nonstarter in Lisp where you want

    ("ab" "cd" "ef")

    to have three elements, not one. So if it worked that way, we would
    need

    ("ab", "cd", "ef")

    which is too horrible a price to pay for string literal catenation.

    ANSI Lisp just allows line breaks in strings. However, all the white
    space is combined into it.

    Allow line breaks in string literals means that if you forget to
    close a quote, it might not be diagnosed until the end of file!
    The strictness of having to close a string in the same line is
    worthwhile for diagnosis.

    In TXR Lisp, I solved multiple problems with a backslash continuation.

    "abc \
    def"

    encodes the string "abcdef". All unescaped whitespace around the
    backslash is deleted. If you want "abc def", you can plant an escaped
    space in there:

    "abc \ \
    def"

    or

    "abc \
    \ def"

    Unfortunately, it does mean we have the run of backslashes down the
    right side:

    "abc \
    def \
    ghi \
    ... "

    I can live with that.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike Sanders@21:1/5 to Lawrence D'Oliveiro on Mon Feb 26 22:03:11 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    I like using this for long strings:

    fputs
    (
    "When an uncleft or a bulkbit wins one or more bernstonebits above\n"
    "its own, it takes on a backward lading. When it loses one or\n"
    "more, it takes on a forward lading. Such a mote is called a\n"
    "*farer*, for that the drag between unlike ladings flits it. When\n"
    "bernstonebits flit by themselves, it may be as a bolt of\n"
    "lightning, a spark off some faststanding chunk, or the everyday\n"
    "flow of bernstoneness through wires.\n",
    stdout
    );

    Of languages that derive ideas from C, only C++ and Python seem to have
    kept this. Java, JavaScript and PHP have not, for some reason.

    Easy solution Lawrence. Why not use something like bin2c:

    <https://www.segger.com/free-utilities/bin2c/>

    void Usage() {

    #include "my_text"

    printf("%s\n", my_var);

    }


    --
    :wq
    Mike Sanders

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Mike Sanders on Mon Feb 26 23:17:36 2024
    On Mon, 26 Feb 2024 22:03:11 -0000 (UTC), Mike Sanders wrote:

    Easy solution Lawrence. Why not use something like bin2c:

    My tool for easy editing of such embedded text is the Emacs macros in multiquote.el, here <https://gitlab.com/ldo/emacs-prefs>.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Mike Sanders on Tue Feb 27 09:36:38 2024
    On 26/02/2024 23:03, Mike Sanders wrote:
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    I like using this for long strings:

    fputs
    (
    "When an uncleft or a bulkbit wins one or more bernstonebits above\n"
    "its own, it takes on a backward lading. When it loses one or\n"
    "more, it takes on a forward lading. Such a mote is called a\n"
    "*farer*, for that the drag between unlike ladings flits it. When\n"
    "bernstonebits flit by themselves, it may be as a bolt of\n"
    "lightning, a spark off some faststanding chunk, or the everyday\n" >> "flow of bernstoneness through wires.\n",
    stdout
    );

    Of languages that derive ideas from C, only C++ and Python seem to have
    kept this. Java, JavaScript and PHP have not, for some reason.

    Easy solution Lawrence. Why not use something like bin2c:

    <https://www.segger.com/free-utilities/bin2c/>

    Because it generates files that have Segger copyright notices stamped on
    them? At least, that's how it appears from that web page.

    There are lots of open source alternatives that do similar things, with different variations in the way they generate the output. Or you can
    write your own in about 10 lines of Python, which of course makes it a
    lot easier to customise to fit your own styles and requirements.

    And with C23, we will get #embed, though it is not yet supported by
    major tools.

    <https://en.cppreference.com/w/c/preprocessor/embed>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Lawrence D'Oliveiro on Tue Feb 27 13:18:20 2024
    On 26.02.2024 21:31, Lawrence D'Oliveiro wrote:

    The <<EOD construct that Perl has comes from POSIX shells, and it is very useful in both places. Bash also adds a <<<-construct.

    Yes, bash adopted the '<<<' "here-strings".

    Question: How would you do two separate <<-strings in the same shell
    command?

    Can you give an example what you intend here? (With what semantics?)

    Since '<<' is redirecting the here-document text to stdin of the
    command you can have only one channel.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike Sanders@21:1/5 to David Brown on Tue Feb 27 17:31:36 2024
    David Brown <david.brown@hesbynett.no> wrote:

    Because it generates files that have Segger copyright notices stamped on them? At least, that's how it appears from that web page.

    Then we build our own...

    There are lots of open source alternatives that do similar things, with different variations in the way they generate the output. Or you can
    write your own in about 10 lines of Python, which of course makes it a
    lot easier to customise to fit your own styles and requirements.

    Yeah even simpler ways too, sed/awk/etc

    And with C23, we will get #embed, though it is not yet supported by
    major tools.

    <https://en.cppreference.com/w/c/preprocessor/embed>

    Did not know that was coming down the pike, thanks for sharing the info
    David.

    --
    :wq
    Mike Sanders

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike Sanders@21:1/5 to Lawrence D'Oliveiro on Tue Feb 27 17:27:51 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    My tool for easy editing of such embedded text is the Emacs macros in multiquote.el, here <https://gitlab.com/ldo/emacs-prefs>.

    Neato-burritto, built his own tool chain, 'atta-boy. Interesting page.

    --
    :wq
    Mike Sanders

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to David Brown on Tue Feb 27 18:56:26 2024
    On 27/02/2024 08:36, David Brown wrote:
    On 26/02/2024 23:03, Mike Sanders wrote:
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    I like using this for long strings:

         fputs
           (
             "When an uncleft or a bulkbit wins one or more bernstonebits
    above\n"
             "its own, it takes on a backward lading. When it loses one >>> or\n"
             "more, it takes on a forward lading. Such a mote is called a\n"
             "*farer*, for that the drag between unlike ladings flits it.
    When\n"
             "bernstonebits flit by themselves, it may be as a bolt of\n"
             "lightning, a spark off some faststanding chunk, or the >>> everyday\n"
             "flow of bernstoneness through wires.\n",
             stdout
           );

    Of languages that derive ideas from C, only C++ and Python seem to have
    kept this. Java, JavaScript and PHP have not, for some reason.

    Easy solution Lawrence. Why not use something like bin2c:

    <https://www.segger.com/free-utilities/bin2c/>

    Because it generates files that have Segger copyright notices stamped on them?  At least, that's how it appears from that web page.

    There are lots of open source alternatives that do similar things, with different variations in the way they generate the output.  Or you can
    write your own in about 10 lines of Python, which of course makes it a
    lot easier to customise to fit your own styles and requirements.

    And with C23, we will get #embed, though it is not yet supported by
    major tools.

    <https://en.cppreference.com/w/c/preprocessor/embed>

    Actually I've had such feature, for text files, for some years in my
    older compiler:

    #include <stdio.h>

    int main(void) {
    puts(strinclude(__FILE__));
    }

    This prints out the contents of this sourcefile. Binary files don't work because of embedded zeros, but could have been made to.

    Some stuff is just very easy to do; other stuff like designator chains
    less easy and also less useful.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to David Brown on Tue Feb 27 20:25:27 2024
    On Tue, 27 Feb 2024 09:36:38 +0100, David Brown wrote:

    And with C23, we will get #embed, though it is not yet supported by
    major tools.

    More and more hacks on the preprocessor. Why not just get rid of it and
    replace it with something like m4?

    Because then you will discover that string-based macros are inherently an unmanageable problem.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Lawrence D'Oliveiro on Tue Feb 27 22:12:23 2024
    On 27/02/2024 20:25, Lawrence D'Oliveiro wrote:
    On Tue, 27 Feb 2024 09:36:38 +0100, David Brown wrote:

    And with C23, we will get #embed, though it is not yet supported by
    major tools.

    More and more hacks on the preprocessor. Why not just get rid of it and replace it with something like m4?

    Because then you will discover that string-based macros are inherently an unmanageable problem.

    I hadn't notice that #embed was a preprocessor directive. But that is
    not the problem here, it is this:

    "The expansion of a #embed directive is a token sequence formed from the
    list of integer constant expressions described below."

    If a string like "ABC" really is converted to the five tokens 'A' comma
    'B' comma 'C', then it's going to make long strings and binary files inefficient.

    Embedding a 100KB file will result in a 100KB bigger executable, but
    along the way it may have to generate 200,000 tokens within the
    compiler, half of them commas. Which in turn will need to be turned into 100,000 integer expressions.

    I would hope that implementations find some way of streamlining that
    process, perhaps by turning that 100KB of data directly into a 100KB string.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to bart on Tue Feb 27 23:21:28 2024
    On 27/02/2024 19:56, bart wrote:
    On 27/02/2024 08:36, David Brown wrote:
    On 26/02/2024 23:03, Mike Sanders wrote:
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    I like using this for long strings:

         fputs
           (
             "When an uncleft or a bulkbit wins one or more
    bernstonebits above\n"
             "its own, it takes on a backward lading. When it loses one
    or\n"
             "more, it takes on a forward lading. Such a mote is called
    a\n"
             "*farer*, for that the drag between unlike ladings flits >>>> it. When\n"
             "bernstonebits flit by themselves, it may be as a bolt of\n"
             "lightning, a spark off some faststanding chunk, or the >>>> everyday\n"
             "flow of bernstoneness through wires.\n",
             stdout
           );

    Of languages that derive ideas from C, only C++ and Python seem to have >>>> kept this. Java, JavaScript and PHP have not, for some reason.

    Easy solution Lawrence. Why not use something like bin2c:

    <https://www.segger.com/free-utilities/bin2c/>

    Because it generates files that have Segger copyright notices stamped
    on them?  At least, that's how it appears from that web page.

    There are lots of open source alternatives that do similar things,
    with different variations in the way they generate the output.  Or you
    can write your own in about 10 lines of Python, which of course makes
    it a lot easier to customise to fit your own styles and requirements.

    And with C23, we will get #embed, though it is not yet supported by
    major tools.

    <https://en.cppreference.com/w/c/preprocessor/embed>

    Actually I've had such feature, for text files, for some years in my
    older compiler:

        #include <stdio.h>

        int main(void) {
            puts(strinclude(__FILE__));
        }

    This prints out the contents of this sourcefile. Binary files don't work because of embedded zeros, but could have been made to.

    Some stuff is just very easy to do; other stuff like designator chains
    less easy and also less useful.

    The #embed pre-processor directive turns the file into a list of integer constants, one per byte (unless an implementation offers other options).
    That makes it a little less convenient for strings than your solution,
    but more convenient for data files. There's no harm in supporting both!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Janis Papanagnou on Tue Feb 27 23:10:17 2024
    On Tue, 27 Feb 2024 13:18:20 +0100, Janis Papanagnou wrote:

    On 26.02.2024 21:31, Lawrence D'Oliveiro wrote:

    Question: How would you do two separate <<-strings in the same shell
    command?

    Can you give an example what you intend here? (With what semantics?)

    Since '<<' is redirecting the here-document text to stdin of the command
    you can have only one channel.

    Perl lets you do something like

    func(<<EOD1, <<EOD2);
    ... contents of first string ...
    EOD1
    ... contents of second string ...
    EOD2

    But this doesn’t work in Bash. However, in a Posix shell, remember you can specify the number of the file descriptor you want to redirect, e.g.

    diff -u /dev/fd/8 /dev/fd/9 8<<'EOD1' 9<<'EOD2'
    ... contents of first string ...
    EOD1
    ... contents of second string ...
    EOD2

    Note I add the single quotes to prevent expansion of “$”-sequences within the strings. (I think this might be needed in Perl, too.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Keith Thompson on Tue Feb 27 23:03:08 2024
    On Tue, 27 Feb 2024 12:35:35 -0800, Keith Thompson wrote:

    (m4? Seriously?)

    Do you know of any more powerful string-based macro processor?

    The C preprocessor operates on preprocessor tokens, not just strings.

    Think of “hygienic” macros in the Lisps, and why that is just impossible
    in any string-based preprocessor.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to David Brown on Tue Feb 27 22:52:33 2024
    On Tue, 27 Feb 2024 23:21:28 +0100, David Brown wrote:

    The #embed pre-processor directive turns the file into a list of integer constants, one per byte (unless an implementation offers other options).

    What a waste of time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Lawrence D'Oliveiro on Wed Feb 28 00:50:46 2024
    XPost: comp.unix.shell

    On 28.02.2024 00:10, Lawrence D'Oliveiro wrote:
    On Tue, 27 Feb 2024 13:18:20 +0100, Janis Papanagnou wrote:

    On 26.02.2024 21:31, Lawrence D'Oliveiro wrote:

    Question: How would you do two separate <<-strings in the same shell
    command?

    Can you give an example what you intend here? (With what semantics?)

    Since '<<' is redirecting the here-document text to stdin of the command
    you can have only one channel.

    Perl lets you do something like

    func(<<EOD1, <<EOD2);
    ... contents of first string ...
    EOD1
    ... contents of second string ...
    EOD2

    But this doesn’t work in Bash. However, in a Posix shell, remember you can specify the number of the file descriptor you want to redirect, e.g.

    diff -u /dev/fd/8 /dev/fd/9 8<<'EOD1' 9<<'EOD2'
    ... contents of first string ...
    EOD1
    ... contents of second string ...
    EOD2

    Note I add the single quotes to prevent expansion of “$”-sequences within the strings. (I think this might be needed in Perl, too.)

    I see. - Yes, you can do that in POSIX shells as well. - Note that I set F'up-to CUS. And post the response there as a f'up to this post.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Lawrence D'Oliveiro on Wed Feb 28 01:09:46 2024
    On 2024-02-27, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Tue, 27 Feb 2024 23:21:28 +0100, David Brown wrote:

    The #embed pre-processor directive turns the file into a list of integer
    constants, one per byte (unless an implementation offers other options).

    What a waste of time.

    Plus easily doable in 1970's Lisp.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to bart on Wed Feb 28 12:54:06 2024
    On 27/02/2024 23:12, bart wrote:
    On 27/02/2024 20:25, Lawrence D'Oliveiro wrote:
    On Tue, 27 Feb 2024 09:36:38 +0100, David Brown wrote:

    And with C23, we will get #embed, though it is not yet supported by
    major tools.

    More and more hacks on the preprocessor. Why not just get rid of it and
    replace it with something like m4?

    Because then you will discover that string-based macros are inherently an
    unmanageable problem.

    I hadn't notice that #embed was a preprocessor directive. But that is
    not the problem here, it is this:

    "The expansion of a #embed directive is a token sequence formed from the
    list of integer constant expressions described below."

    If a string like "ABC" really is converted to the five tokens 'A' comma
    'B' comma 'C', then it's going to make long strings and binary files inefficient.

    Embedding a 100KB file will result in a 100KB bigger executable, but
    along the way it may have to generate 200,000 tokens within the
    compiler, half of them commas. Which in turn will need to be turned into 100,000 integer expressions.

    I would hope that implementations find some way of streamlining that
    process, perhaps by turning that 100KB of data directly into a 100KB
    string.

    They won't use strings, they will use data blobs - binary data. Then
    there is no issue with null bytes. And yes, implementations will skip
    the token generation (unless you are doing something weird, such as
    using #embed to read the parameters to a function call).

    Tests with prototype implementations gave extremely fast results.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Kaz Kylheku on Wed Feb 28 12:50:10 2024
    On 28/02/2024 02:09, Kaz Kylheku wrote:
    On 2024-02-27, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Tue, 27 Feb 2024 23:21:28 +0100, David Brown wrote:

    The #embed pre-processor directive turns the file into a list of integer >>> constants, one per byte (unless an implementation offers other options).

    What a waste of time.

    Plus easily doable in 1970's Lisp.



    That would be useful, if we were living in the 1970's or if anyone had
    wanted to learn Lisp this side of the millennium bug.

    As I mentioned before, it's not particularly difficult to do this kind
    of manipulation, and people write utilities for them in a variety of
    languages, or download a variety of free tools for the job.

    But it will often be more convenient to have it built into the language
    and compiler. And for those interested in speed, the test
    implementations have handled this far more efficiently than other
    techniques. Logically, #embed turns the file into a list of numbers. In practice, if you use it for the common case of initialising a const
    array of unsigned char, the compiler simply copies and pastes the file
    into the output as a binary blob.

    It would, IMHO, have been useful also to have had an "embed operator" in
    the manner of the "pragma operator", so that it could be used in a macro definition.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to David Brown on Wed Feb 28 13:13:13 2024
    On 28/02/2024 11:54, David Brown wrote:
    On 27/02/2024 23:12, bart wrote:
    On 27/02/2024 20:25, Lawrence D'Oliveiro wrote:
    On Tue, 27 Feb 2024 09:36:38 +0100, David Brown wrote:

    And with C23, we will get #embed, though it is not yet supported by
    major tools.

    More and more hacks on the preprocessor. Why not just get rid of it and
    replace it with something like m4?

    Because then you will discover that string-based macros are
    inherently an
    unmanageable problem.

    I hadn't notice that #embed was a preprocessor directive. But that is
    not the problem here, it is this:

    "The expansion of a #embed directive is a token sequence formed from
    the list of integer constant expressions described below."

    If a string like "ABC" really is converted to the five tokens 'A'
    comma 'B' comma 'C', then it's going to make long strings and binary
    files inefficient.

    Embedding a 100KB file will result in a 100KB bigger executable, but
    along the way it may have to generate 200,000 tokens within the
    compiler, half of them commas. Which in turn will need to be turned
    into 100,000 integer expressions.

    I would hope that implementations find some way of streamlining that
    process, perhaps by turning that 100KB of data directly into a 100KB
    string.

    They won't use strings, they will use data blobs - binary data.  Then
    there is no issue with null bytes.

    AFAIK strings in C can have embedded zeros when not assumed to be zero-terminated. So here:

    char s[]={1,2,3,0,4,5,6};

    s will have a length of 7.

      And yes, implementations will skip
    the token generation (unless you are doing something weird, such as
    using #embed to read the parameters to a function call).

    What happens if you do -E to preprocess only?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to bart on Wed Feb 28 15:08:52 2024
    On 28/02/2024 14:13, bart wrote:
    On 28/02/2024 11:54, David Brown wrote:
    On 27/02/2024 23:12, bart wrote:
    On 27/02/2024 20:25, Lawrence D'Oliveiro wrote:
    On Tue, 27 Feb 2024 09:36:38 +0100, David Brown wrote:

    And with C23, we will get #embed, though it is not yet supported by
    major tools.

    More and more hacks on the preprocessor. Why not just get rid of it and >>>> replace it with something like m4?

    Because then you will discover that string-based macros are
    inherently an
    unmanageable problem.

    I hadn't notice that #embed was a preprocessor directive. But that is
    not the problem here, it is this:

    "The expansion of a #embed directive is a token sequence formed from
    the list of integer constant expressions described below."

    If a string like "ABC" really is converted to the five tokens 'A'
    comma 'B' comma 'C', then it's going to make long strings and binary
    files inefficient.

    Embedding a 100KB file will result in a 100KB bigger executable, but
    along the way it may have to generate 200,000 tokens within the
    compiler, half of them commas. Which in turn will need to be turned
    into 100,000 integer expressions.

    I would hope that implementations find some way of streamlining that
    process, perhaps by turning that 100KB of data directly into a 100KB
    string.

    They won't use strings, they will use data blobs - binary data.  Then
    there is no issue with null bytes.

    AFAIK strings in C can have embedded zeros when not assumed to be zero-terminated. So here:

        char s[]={1,2,3,0,4,5,6};

    s will have a length of 7.

    That's not a string, it's an array of char. A "string" in C is "a
    contiguous sequence of characters terminated by and including the first
    null character". The difference is crucial in respect to the handling
    of null bytes. And it is the main reason for #embed generating a comma-separated sequence of integer constants rather than a string. (It
    also avoids messy hex character sequences if you show the output of
    #embed somewhere.)


      And yes, implementations will skip the token generation (unless you
    are doing something weird, such as using #embed to read the parameters
    to a function call).

    What happens if you do -E to preprocess only?


    That's something weird :-)

    I guess you get the integer list.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to David Brown on Wed Feb 28 20:56:28 2024
    On Wed, 28 Feb 2024 12:50:10 +0100, David Brown wrote:

    ... people write utilities for them in a variety of languages ...

    But it will often be more convenient to have it built into the language
    and compiler.

    What can be built into the language can only ever be a small subset of
    the many and varied ways that people have incorporated data blobs into
    their programs. Often these will need to have custom structures with
    computed header fields, that kind of thing. So you will need custom
    build tools to construct these structures, and then you might as well
    include those blobs directly into the final build, rather than go
    through some extra step of pretending to turn them back into some
    source form.

    For example, here’s an old Android project of mine (OK, so the app is
    Java code, but the same principle applies) <https://bitbucket.org/ldo17/unicode_browser_android/src/master/>
    where I wrote a custom Python script to read a Nameslist.txt file
    downloaded from unicode.org to generate a table which could be loaded
    into memory quickly for easy searching.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Lawrence D'Oliveiro on Wed Feb 28 21:34:14 2024
    On 28/02/2024 20:56, Lawrence D'Oliveiro wrote:
    On Wed, 28 Feb 2024 12:50:10 +0100, David Brown wrote:

    ... people write utilities for them in a variety of languages ...

    But it will often be more convenient to have it built into the language
    and compiler.

    What can be built into the language can only ever be a small subset of
    the many and varied ways that people have incorporated data blobs into
    their programs. Often these will need to have custom structures with
    computed header fields, that kind of thing. So you will need custom
    build tools to construct these structures, and then you might as well
    include those blobs directly into the final build, rather than go
    through some extra step of pretending to turn them back into some
    source form.

    For example, here’s an old Android project of mine (OK, so the app is
    Java code, but the same principle applies) <https://bitbucket.org/ldo17/unicode_browser_android/src/master/>
    where I wrote a custom Python script to read a Nameslist.txt file
    downloaded from unicode.org to generate a table which could be loaded
    into memory quickly for easy searching.

    I can see now where you get your coding style from. You seem to like
    stretching things out vertically as much as possible:

    public void Add
    (
    int CategoryCode,
    ItemType Item
    )
    /* Use this instead of add to populate CodeToIndex table. */
    {
    CodeToIndex.put(CategoryCode, getCount());
    add(Item);
    } /*Add*/

    In C:

    void Add(int CategoryCode, ItemType Item) {
    CodeToIndex_put(CategoryCode, getCount());
    add(Item);
    }

    4 non-comment lines versus 9. I know Java needs tons of boilerplate, but
    but it is not all the language's fault.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Keith Thompson on Wed Feb 28 23:01:22 2024
    On 28/02/2024 21:57, Keith Thompson wrote:
    David Brown <david.brown@hesbynett.no> writes:
    [...]
    They won't use strings, they will use data blobs - binary data. Then
    there is no issue with null bytes. And yes, implementations will skip
    the token generation (unless you are doing something weird, such as
    using #embed to read the parameters to a function call).

    Tests with prototype implementations gave extremely fast results.

    I'm not sure how that would work. #embed is a preprocessor directive,
    and at least in the abstract model it has to expand to valid C code.

    I would have expected that it would simply generate the list of comma-separated integer constants described in the standard; later
    phases would simply parse that list and generate code as if that
    sequence had been written in the original source file. Do you know of
    an implementation that does something else?

    For example, say you have a file "foo.dat" containing 4 bytes with
    values 0, 1, 2, and 3. This would be perfectly valid:

    struct foo {
    unsigned char a;
    unsigned short b;
    unsigned int c;
    double d;
    };

    struct foo obj = {
    #embed "foo.dat"
    };

    It would be unfortunate if your example was allowed. Clearly a binary representation of an instance of your struct would probably require 16
    bytes rather than 4, of which one may be padding.

    Certainly if you were to write it out to disk as binary, it would need
    more than 4.


    #embed isn't defined to translate an input file to a sequence of bytes.
    It's defined to translate an input file to a sequence of integer
    constant expressions.

    Maybe it should be defined exactly like that, because that is what
    people might expect. You example is better off using a normal text file
    which contains an actual comma-delimited list (and which can mix ints
    and floats), and a regular #include.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to bart on Wed Feb 28 23:52:55 2024
    On Wed, 28 Feb 2024 21:34:14 +0000, bart wrote:

    In C:

    void Add(int CategoryCode, ItemType Item) {
    CodeToIndex_put(CategoryCode, getCount());
    add(Item);
    }

    4 non-comment lines versus 9. I know Java needs tons of boilerplate, but
    but it is not all the language's fault.

    Or how about

    void Add(int CategoryCode, ItemType Item) {CodeToIndex_put(CategoryCode, getCount());add(Item);}

    Wow! I never realized you could do that in C!! I thought it was an
    error to put stuff after column 72 or something. Thanks for the tip!!!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Keith Thompson on Thu Feb 29 00:47:25 2024
    On 28/02/2024 23:31, Keith Thompson wrote:
    bart <bc@freeuk.com> writes:

    It would be unfortunate if your example was allowed. Clearly a binary
    representation of an instance of your struct would probably require 16
    bytes rather than 4, of which one may be padding.

    Depending on the sizes and alignments of the various types, sure.
    So what?


    If you have suggestions for alternate ways to define #embed, they might
    be interesting, but it's too late to change the existing specification.


    My early comments on this were about compiler performance. I suggested
    there might be a way to turn 100,000 byte values in a file, directly
    into a 100KB string or data block, without needing to first convert
    100,000 values into 100,000 integer expressions representated as tokens,
    and to then parse those 100,000 expressions into AST nodes etc.

    DB suggested something like that was actually done. But you can't do
    that if those 100,000 numbers represent from 100KB to 800KB of memory
    depending on the data type of the strucure they're initialising.

    They might even be mixed type. Or it might be an example like this:

    A binary file contains 8 bytes representing one IEEE754 float value. It
    is desired to use that to initialise a double array of one element.

    However #embed will that into 8 integer values of 0 to 255 each (I assume).

    It's not clear either what happens when one of the integers has the
    value 150, say, but it is used to initialise an element of type (signed)
    char. It sounds like it would make it hard to inialise a char[] array,
    when char is signed, from a file of UTF8 text.

    Basically, #embed is dumb.

    For flexibility, I wouldn't use #embed at all. Just have an actual comma-separated set of values in a text file, and use #include instead.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Lawrence D'Oliveiro on Thu Feb 29 00:15:17 2024
    On 28/02/2024 23:52, Lawrence D'Oliveiro wrote:
    On Wed, 28 Feb 2024 21:34:14 +0000, bart wrote:

    In C:

    void Add(int CategoryCode, ItemType Item) {
    CodeToIndex_put(CategoryCode, getCount());
    add(Item);
    }

    4 non-comment lines versus 9. I know Java needs tons of boilerplate, but
    but it is not all the language's fault.

    Or how about

    void Add(int CategoryCode, ItemType Item) {CodeToIndex_put(CategoryCode, getCount());add(Item);}

    Wow! I never realized you could do that in C!! I thought it was an
    error to put stuff after column 72 or something. Thanks for the tip!!!

    Well, you could write an entire program on one line.

    Or you can write an entire program in one thin column:

    v\
    o\
    i\
    d\
    ....

    Or you can use common sense and avoiding writing code which is either
    too compact or so spread out vertically that you have to hunt for the
    actual code. Like trying to find the bits of meat in a thin soup.

    That's what I took away from your Java code, which looks remarkably like
    the spaced-out examples you posted recently.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to bart on Thu Feb 29 02:53:33 2024
    On Thu, 29 Feb 2024 00:15:17 +0000, bart wrote:

    Or you can use common sense and avoiding writing code which is either
    too compact or so spread out vertically that you have to hunt for the
    actual code. Like trying to find the bits of meat in a thin soup.

    Terribly sorry about that. I wonder if you could look at this part of the
    same code file:

    final android.util.SparseArray<Integer> CodeToIndex =
    new android.util.SparseArray<Integer>();

    and show me how to thicken that part of my humble, tasteless gruel? Maybe
    using that same “_” trick you used to do OO in C in your previous example?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Lawrence D'Oliveiro on Thu Feb 29 08:58:40 2024
    On 28/02/2024 21:56, Lawrence D'Oliveiro wrote:
    On Wed, 28 Feb 2024 12:50:10 +0100, David Brown wrote:

    ... people write utilities for them in a variety of languages ...

    But it will often be more convenient to have it built into the language
    and compiler.

    What can be built into the language can only ever be a small subset of
    the many and varied ways that people have incorporated data blobs into
    their programs.

    Of course. But that doesn't mean that a language should not include a
    feature that makes it easy for a lot of people to get some data blobs
    into their code. Maybe /you/ won't find it useful, but other people will.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Keith Thompson on Thu Feb 29 10:10:10 2024
    On 28/02/2024 22:57, Keith Thompson wrote:
    David Brown <david.brown@hesbynett.no> writes:
    [...]
    They won't use strings, they will use data blobs - binary data. Then
    there is no issue with null bytes. And yes, implementations will skip
    the token generation (unless you are doing something weird, such as
    using #embed to read the parameters to a function call).

    Tests with prototype implementations gave extremely fast results.

    I'm not sure how that would work. #embed is a preprocessor directive,
    and at least in the abstract model it has to expand to valid C code.

    I would have expected that it would simply generate the list of comma-separated integer constants described in the standard; later
    phases would simply parse that list and generate code as if that
    sequence had been written in the original source file. Do you know of
    an implementation that does something else?


    The key thing, as I understand it, is that the compiler gets to know
    that the integers in the list are all "nice". And since the
    preprocessor and the compiler are part of the same implementation (even
    if they are separate programs communicating with pipes or temporary
    files), the preprocessor could pass on the binary blob in a pre-parsed form.

    Think about what a preprocessor and compiler does with the initialisers
    in an array, written in normal text (such as by using "xxd -i" or
    another external script). For each integer, it has to divide up the
    tokens, identify the comma, parse the integer, check that it is a valid integer, figure out its type based on the size (and suffix, if any). It
    needs to record the line number and column number for possible later
    reference in error or warning messages. It has to check the value of
    the integer against the type for the array elements, and possibly change
    the value to suit, or issue warnings for out-of-range values. It has to allocate all the space to store this information as it goes along,
    without knowing the size of the array - so it will be lots of small
    mallocs and/or wasted space. It's a /lot/. (Simpler compilers can get
    away with a bit less effort, especially if they have more limited warnings.)

    With #embed, the preprocessor can generate a compiler-specific "start of
    embed" informational directive (much like "#line" directives and such
    things generated by preprocessors today), then the data in a very
    specific format, then an "end of embed" directive. It could, for
    example, generate all the integers in the format "0xAB, " with 16
    elements per line. The compiler wouldn't need to parse the data
    normally - it knows exactly how many elements there are (from the "start
    of embed" directive), it knows exactly where to find each entry (as each
    is 6 characters long), it only needs to look at two of these characters, there's never any errors, the source line number is fixed (at the #embed
    line), and so on.

    A more tightly coupled preprocessor and compiler can do even better -
    for array initialisation, the binary blob could be used directly without
    ever generating integer constants or parsing them.

    The results of testing are that #embed is /massively/ faster and lower
    memory compared to external generators, especially for larger files.
    And it gives you the data on-hand for optimisation purposes, unlike
    external direct linking of binary blobs. (So you can get the size of
    the array, or use values from it as compile-time known values.)

    For example, say you have a file "foo.dat" containing 4 bytes with
    values 0, 1, 2, and 3. This would be perfectly valid:

    struct foo {
    unsigned char a;
    unsigned short b;
    unsigned int c;
    double d;
    };

    struct foo obj = {
    #embed "foo.dat"
    };

    #embed isn't defined to translate an input file to a sequence of bytes.
    It's defined to translate an input file to a sequence of integer
    constant expressions.


    Yes. But the prime speed (and memory usage) gains come in, are for
    large files, and that means array initialisers. That does not conflict
    with using it for cases like yours.

    *Maybe* a compiler could optimize for the case where it knows that it's
    being used to initialize an array of unsigned char, but (a) that would require the preprocessor to have information that normally doesn't exist until later phases, and (b) I'm not convinced it would be worth the
    effort.


    Look at <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1040r6.html#design-practice-speed>.

    In those tests, for a 40 MB file gcc #embed is 200 times faster than
    "xxd -i" generated files, and takes about 2.5% of the memory. It scales
    to 1 GB files. And that's just a proof-of-concept implementation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Lawrence D'Oliveiro on Thu Feb 29 09:20:30 2024
    On 29/02/2024 02:53, Lawrence D'Oliveiro wrote:
    On Thu, 29 Feb 2024 00:15:17 +0000, bart wrote:

    Or you can use common sense and avoiding writing code which is either
    too compact or so spread out vertically that you have to hunt for the
    actual code. Like trying to find the bits of meat in a thin soup.

    Terribly sorry about that. I wonder if you could look at this part of the same code file:

    final android.util.SparseArray<Integer> CodeToIndex =
    new android.util.SparseArray<Integer>();

    and show me how to thicken that part of my humble, tasteless gruel? Maybe using that same “_” trick you used to do OO in C in your previous example?

    You've shown an example of a piece of meat. In main.java, 70% of the
    lines are either blanks or contain only an opening or closing bracket.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to David Brown on Thu Feb 29 10:18:21 2024
    On 29/02/2024 09:10, David Brown wrote:
    On 28/02/2024 22:57, Keith Thompson wrote:

    *Maybe* a compiler could optimize for the case where it knows that it's
    being used to initialize an array of unsigned char, but (a) that would
    require the preprocessor to have information that normally doesn't exist
    until later phases, and (b) I'm not convinced it would be worth the
    effort.


    Look at <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1040r6.html#design-practice-speed>.

    In those tests, for a 40 MB file gcc #embed is 200 times faster than
    "xxd -i" generated files, and takes about 2.5% of the memory.  It scales
    to 1 GB files.  And that's just a proof-of-concept implementation.

    I've just down my own tests, with a 40MB data file containing random
    A..Z letters (so can be processed as a text file).

    This was converted also to a 120MB text file contain a list of numbers ("65,66,73,...", 3 characters for each data byte).

    Using 'strinclude' in my old C compiler, it took about 1 second to build
    this program:

    #include <stdio.h>
    #include <string.h>

    char* s=strinclude("data");

    int main(void) {
    printf("%zu\n", strlen(s));
    }

    (Running it shows '40000000'.) The same test in my language (which has
    no intermediate ASM stage) took 0.3 seconds.

    Next I tried instead that 120MB text file containing the same data but
    as text, initialising a char[] array using #include.

    Tcc took 12 seconds. Bcc took 56 seconds (via ASM etc).

    gcc got up to about 3GB memory usage then 'cc1' failed trying to
    allocate 0.5GB, after about a minute.

    Processing long list of numbers DOES use considerable resources. Bear in
    mind that #embed also needs to take binary data and generate tokens,
    possibly converting each binary number to text.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Keith Thompson on Thu Feb 29 14:31:11 2024
    On 28/02/2024 21:36, Keith Thompson wrote:
    bart <bc@freeuk.com> writes:
    [...]
    AFAIK strings in C can have embedded zeros when not assumed to be
    zero-terminated. So here:

    char s[]={1,2,3,0,4,5,6};

    s will have a length of 7.

    Strings *by definition* cannot have embedded zeros. A null character terminates a string.

    A string literal can have embedded \0 characters, but if you're
    suggesting that #embed should expand to a string literal, I can see
    several disadvantages and no significant advantages. For one thing, the
    data may or may not end with a null character; string literals always
    do.

    Not here:

    char s[] = "ABC";
    char t[3] = "DEF";

    The "DEF" string doesn't end with a zero.

    Is 'string' given a special meaning in the standard?

    /That/ would seem to me to be too restrictive. Does this:

    char *s;

    define a pointer to a such string, or can it be any kind of data? For
    example, `char*` is used by the GetOpenFileName WinAPI function for a
    /series/ of zero-terminated strings which itself is terminated with two
    zero bytes.

    So it is some property that is attributed to the data that will be stored.

    I normally use `cstring` or `stringz` outside the language when refering
    to a zero-terminated sequences of characters, which implies that
    embedded zeros aren't allowed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Harnden@21:1/5 to bart on Thu Feb 29 15:22:18 2024
    On 29/02/2024 14:31, bart wrote:
    On 28/02/2024 21:36, Keith Thompson wrote:
    bart <bc@freeuk.com> writes:
    [...]
    AFAIK strings in C can have embedded zeros when not assumed to be
    zero-terminated. So here:

         char s[]={1,2,3,0,4,5,6};

    s will have a length of 7.

    Strings *by definition* cannot have embedded zeros.  A null character
    terminates a string.

    A string literal can have embedded \0 characters, but if you're
    suggesting that #embed should expand to a string literal, I can see
    several disadvantages and no significant advantages.  For one thing, the
    data may or may not end with a null character; string literals always
    do.

    Not here:

        char s[]  = "ABC";
        char t[3] = "DEF";

    The "DEF" string doesn't end with a zero.

    And is, therefore, not a string.


    Is 'string' given a special meaning in the standard?

    Yes. Things that work with the strX functions. Which means they are
    '\0' terminated.


    /That/ would seem to me to be too restrictive. Does this:

       char *s;

    define a pointer to a such string, or can it be any kind of data? For

    That points to a char. That could be followed by more chars and it one
    of those is a '\0', then it's a string. You know this.


    example, `char*` is used by the GetOpenFileName WinAPI function for a /series/ of zero-terminated strings which itself is terminated with two
    zero bytes.

    That is a windowsism, then.

    Why didn't they use the NULL terminated char **argv kind of thing?


    So it is some property that is attributed to the data that will be stored.

    I normally use `cstring` or `stringz` outside the language when refering
    to a zero-terminated sequences of characters, which implies that
    embedded zeros aren't allowed.





    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Malcolm McLean on Thu Feb 29 16:19:45 2024
    On 29/02/2024 12:56, Malcolm McLean wrote:
    On 28/02/2024 21:36, Keith Thompson wrote:
    bart <bc@freeuk.com> writes:
    [...]
    AFAIK strings in C can have embedded zeros when not assumed to be
    zero-terminated. So here:

         char s[]={1,2,3,0,4,5,6};

    s will have a length of 7.

    Strings *by definition* cannot have embedded zeros.  A null character
    terminates a string.

    C strings. Not strings in other programming languages.

    Let me point you to the name of this Usenet group.

    And strings in any programming language have either :

    1. A string of characters and a terminating null, thus no embedded null characters.
    2. A starting length (such as Pascal strings).
    3. A fixed size.
    4. A more advanced structure.

    An array of bytes is not a "string".

    And only if you
    define "C strings" in a rather restrictive but, to be fair, totally legitimate way. So I wouldn't have put in the asterisks.


    The definition of "C string" is given in section 7.1.1p1 of the C
    standards. There is only one definition of "C string".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From tTh@21:1/5 to bart on Thu Feb 29 16:29:24 2024
    On 2/29/24 01:47, bart wrote:

    My early comments on this were about compiler performance. I suggested
    there might be a way to turn 100,000 byte values in a file, directly
    into a 100KB string or data block, without needing to first convert
    100,000 values into 100,000 integer expressions representated as tokens,
    and to then parse those 100,000 expressions into AST nodes etc.

    But you HAVE to do that il #embed is in the preprocessor,
    because his job is to give compilable text to the real
    compiler. No other way is possible.

    Basically, #embed is dumb.

    No.

    --
    +---------------------------------------------------------------------+
    | https://tube.interhacker.space/a/tth/video-channels | +---------------------------------------------------------------------+

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to bart on Thu Feb 29 16:30:05 2024
    On 29/02/2024 15:31, bart wrote:
    On 28/02/2024 21:36, Keith Thompson wrote:
    bart <bc@freeuk.com> writes:
    [...]
    AFAIK strings in C can have embedded zeros when not assumed to be
    zero-terminated. So here:

         char s[]={1,2,3,0,4,5,6};

    s will have a length of 7.

    Strings *by definition* cannot have embedded zeros.  A null character
    terminates a string.

    A string literal can have embedded \0 characters, but if you're
    suggesting that #embed should expand to a string literal, I can see
    several disadvantages and no significant advantages.  For one thing, the
    data may or may not end with a null character; string literals always
    do.

    Not here:

        char s[]  = "ABC";

    "ABC" is a "string literal". Once things like concatenation of adjacent strings, macro expansion, etc., is complete, a null character is
    appended to it. Then it is used as an initialiser for the array "s".
    After initialisation, "s" is an array of 4 chars and contains a string.

    (Note - a "string literal" might not be a "string", because string
    literals can contain embedded nulls. This is a footnote in 6.4.5
    describing string literals.)

        char t[3] = "DEF";

    The "DEF" string doesn't end with a zero.

    "DEF" is not a "string" - it is a "string literal". It does get a null character appended during translation phase 7. But only the first three characters - therefore not including the null character - get copied to
    "t" during the initialisation of "t". "t" is an array of 3 chars, and
    it does not contain a string.


    Is 'string' given a special meaning in the standard?


    Yes. See 7.1.1p1.

    /That/ would seem to me to be too restrictive. Does this:

       char *s;

    define a pointer to a such string, or can it be any kind of data? For example, `char*` is used by the GetOpenFileName WinAPI function for a /series/ of zero-terminated strings which itself is terminated with two
    zero bytes.


    "char *s;" declares a pointer to a char, or a pointer to the start of an
    array of char. It is /not/ a string, or a pointer to a string. C
    strings are values, and exist at run-time - they are not types. So "s"
    can point to a string, or a char (which will be a string if and only if
    it is a null character), or an array of chars (which may or may not
    contain a string), or it could point to anything else.

    So it is some property that is attributed to the data that will be stored.

    Yes.


    I normally use `cstring` or `stringz` outside the language when refering
    to a zero-terminated sequences of characters, which implies that
    embedded zeros aren't allowed.


    That makes sense. Different languages have different ways of holding
    sequences of characters (generic "string" data), so you need to qualify
    the term if it is not clear from the context.

    But when we are discussing C, and there is no other qualification,
    "string" means "C string" - the definition of "string" given in the C standards.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From tTh@21:1/5 to bart on Thu Feb 29 16:34:24 2024
    On 2/29/24 11:18, bart wrote:
    Using 'strinclude' in my old C compiler, it took about 1 second to build
    this program:

      #include <stdio.h>
      #include <string.h>

      char* s=strinclude("data");

      int main(void) {
         printf("%zu\n", strlen(s));
     }

    tth@redlady:~/Desktop$ man strinclude
    No manual entry for strinclude
    tth@redlady:~/Desktop$

    --
    +---------------------------------------------------------------------+
    | https://tube.interhacker.space/a/tth/video-channels | +---------------------------------------------------------------------+

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to bart on Thu Feb 29 15:48:51 2024
    bart <bc@freeuk.com> writes:
    On 28/02/2024 23:52, Lawrence D'Oliveiro wrote:
    On Wed, 28 Feb 2024 21:34:14 +0000, bart wrote:

    In C:

    void Add(int CategoryCode, ItemType Item) {
    CodeToIndex_put(CategoryCode, getCount());
    add(Item);
    }

    4 non-comment lines versus 9. I know Java needs tons of boilerplate, but >>> but it is not all the language's fault.

    Or how about

    void Add(int CategoryCode, ItemType Item) {CodeToIndex_put(CategoryCode, getCount());add(Item);}

    Wow! I never realized you could do that in C!! I thought it was an
    error to put stuff after column 72 or something. Thanks for the tip!!!

    Well, you could write an entire program on one line.

    int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}

    (A winner from the obfuscated C contest).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Scott Lurndal on Thu Feb 29 17:03:41 2024
    On 29.02.2024 16:48, Scott Lurndal wrote:

    int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}

    What does it do?

    What preconditions must be fulfilled or what additions
    does it need to compile?


    (A winner from the obfuscated C contest).

    (Are non-compiling C sources allowed in the contest?)

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to bart on Thu Feb 29 15:53:04 2024
    bart <bc@freeuk.com> writes:
    On 28/02/2024 23:31, Keith Thompson wrote:
    bart <bc@freeuk.com> writes:

    It would be unfortunate if your example was allowed. Clearly a binary
    representation of an instance of your struct would probably require 16
    bytes rather than 4, of which one may be padding.

    Depending on the sizes and alignments of the various types, sure.
    So what?


    If you have suggestions for alternate ways to define #embed, they might
    be interesting, but it's too late to change the existing specification.


    My early comments on this were about compiler performance. I suggested
    there might be a way to turn 100,000 byte values in a file, directly
    into a 100KB string or data block, without needing to first convert
    100,000 values into 100,000 integer expressions representated as tokens,
    and to then parse those 100,000 expressions into AST nodes etc.

    DB suggested something like that was actually done. But you can't do
    that if those 100,000 numbers represent from 100KB to 800KB of memory >depending on the data type of the strucure they're initialising.

    An implementation is free to simply pass a variant (or the directive
    itself) of #embed from the pre-processor to the compiler if the programmer isn't using -E, and the compiler could simply copy the embedded file
    into the object file directly, without processing it as a series of
    integer values. Much like the #file and #line directives passed by
    the pre-processor to the compiler.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Janis Papanagnou on Thu Feb 29 16:17:44 2024
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 29.02.2024 16:48, Scott Lurndal wrote:

    int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}

    What does it do?

    What preconditions must be fulfilled or what additions
    does it need to compile?


    (A winner from the obfuscated C contest).

    (Are non-compiling C sources allowed in the contest?)

    https://www.ioccc.org/years.html

    The above is from 'burton'.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to tTh on Thu Feb 29 16:15:47 2024
    tTh <tth@none.invalid> writes:
    On 2/29/24 01:47, bart wrote:

    My early comments on this were about compiler performance. I suggested
    there might be a way to turn 100,000 byte values in a file, directly
    into a 100KB string or data block, without needing to first convert
    100,000 values into 100,000 integer expressions representated as tokens,
    and to then parse those 100,000 expressions into AST nodes etc.

    But you HAVE to do that il #embed is in the preprocessor,
    because his job is to give compilable text to the real
    compiler. No other way is possible.

    The standard does not require the preprocessor to be
    separate from a 'real' compiler. It's acceptable for an implementation
    to implement both in a single executable. Absent -E, the
    preprocessor and compiler can cooperate to efficiently handle
    #embed without generating parseable C code.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Scott Lurndal on Thu Feb 29 18:12:20 2024
    On 29.02.2024 17:17, Scott Lurndal wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 29.02.2024 16:48, Scott Lurndal wrote:

    int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}

    What does it do?

    What preconditions must be fulfilled or what additions
    does it need to compile?

    With the link below I see it "needs" a 600+ lines long Makefile.

    And I see there's some variable definitions with magic numbers
    passed.



    (A winner from the obfuscated C contest).

    (Are non-compiling C sources allowed in the contest?)

    https://www.ioccc.org/years.html

    The above is from 'burton'.

    Thanks.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Keith Thompson on Thu Feb 29 18:17:19 2024
    On 29.02.2024 17:18, Keith Thompson wrote:

    "abc\0def" is a valid string literal, but its value is not a string.
    (No, the standard doesn't say that the value of a string literal is a string.)

    This sounds somewhat strange in my ears. Usually a literal for a type
    will constitute an instance of the type. - I suppose the irregularity
    stems from the fact that there's no explicit string object type in C.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Keith Thompson on Thu Feb 29 17:28:33 2024
    Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    bart <bc@freeuk.com> writes:
    On 28/02/2024 23:31, Keith Thompson wrote:
    bart <bc@freeuk.com> writes:

    It would be unfortunate if your example was allowed. Clearly a binary >>>>> representation of an instance of your struct would probably require 16 >>>>> bytes rather than 4, of which one may be padding.

    Depending on the sizes and alignments of the various types, sure.
    So what?


    If you have suggestions for alternate ways to define #embed, they might >>>> be interesting, but it's too late to change the existing specification. >>>>

    My early comments on this were about compiler performance. I suggested >>>there might be a way to turn 100,000 byte values in a file, directly
    into a 100KB string or data block, without needing to first convert >>>100,000 values into 100,000 integer expressions representated as tokens, >>>and to then parse those 100,000 expressions into AST nodes etc.

    DB suggested something like that was actually done. But you can't do
    that if those 100,000 numbers represent from 100KB to 800KB of memory >>>depending on the data type of the strucure they're initialising.

    An implementation is free to simply pass a variant (or the directive
    itself) of #embed from the pre-processor to the compiler if the programmer >> isn't using -E, and the compiler could simply copy the embedded file
    into the object file directly, without processing it as a series of
    integer values. Much like the #file and #line directives passed by
    the pre-processor to the compiler.

    Sure, an implementation has to operate *as if* it implemented the 8 >translation phases separately. But given a structure initialized with >#embed, it would have to generate additional code to initialize the
    structure members from the bytes of the binary blob.

    Would it? Or could it simply assume that the binary blob
    is already in the same binary format that writing an instance
    of the structure from a C application on the same host would have created?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Janis Papanagnou on Thu Feb 29 17:30:00 2024
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 29.02.2024 17:17, Scott Lurndal wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 29.02.2024 16:48, Scott Lurndal wrote:

    int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}

    What does it do?

    What preconditions must be fulfilled or what additions
    does it need to compile?

    With the link below I see it "needs" a 600+ lines long Makefile.

    The readme simply says compile it and run it
    as ./prog <value between 1 and 512>.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Scott Lurndal on Thu Feb 29 18:58:49 2024
    On 29/02/2024 18:28, Scott Lurndal wrote:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    bart <bc@freeuk.com> writes:
    On 28/02/2024 23:31, Keith Thompson wrote:
    bart <bc@freeuk.com> writes:

    It would be unfortunate if your example was allowed. Clearly a binary >>>>>> representation of an instance of your struct would probably require 16 >>>>>> bytes rather than 4, of which one may be padding.

    Depending on the sizes and alignments of the various types, sure.
    So what?


    If you have suggestions for alternate ways to define #embed, they might >>>>> be interesting, but it's too late to change the existing specification. >>>>>

    My early comments on this were about compiler performance. I suggested >>>> there might be a way to turn 100,000 byte values in a file, directly
    into a 100KB string or data block, without needing to first convert
    100,000 values into 100,000 integer expressions representated as tokens, >>>> and to then parse those 100,000 expressions into AST nodes etc.

    DB suggested something like that was actually done. But you can't do
    that if those 100,000 numbers represent from 100KB to 800KB of memory
    depending on the data type of the strucure they're initialising.

    An implementation is free to simply pass a variant (or the directive
    itself) of #embed from the pre-processor to the compiler if the programmer >>> isn't using -E, and the compiler could simply copy the embedded file
    into the object file directly, without processing it as a series of
    integer values. Much like the #file and #line directives passed by
    the pre-processor to the compiler.

    Sure, an implementation has to operate *as if* it implemented the 8
    translation phases separately. But given a structure initialized with
    #embed, it would have to generate additional code to initialize the
    structure members from the bytes of the binary blob.

    Would it? Or could it simply assume that the binary blob
    is already in the same binary format that writing an instance
    of the structure from a C application on the same host would have created?

    That would depend on the sizes of the fields in the struct, and the size
    of the integer constants in the #embed. The norm for #embed will be
    unsigned char integer constants, so it will only be a direct fit for the
    binary representation of the struct if all the struct fields are
    compatible with that. But a compiler could have vendor parameters on
    the #embed to change those sizes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to David Brown on Thu Feb 29 18:05:50 2024
    David Brown <david.brown@hesbynett.no> writes:
    On 29/02/2024 18:28, Scott Lurndal wrote:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    bart <bc@freeuk.com> writes:
    On 28/02/2024 23:31, Keith Thompson wrote:
    bart <bc@freeuk.com> writes:

    It would be unfortunate if your example was allowed. Clearly a binary >>>>>>> representation of an instance of your struct would probably require 16 >>>>>>> bytes rather than 4, of which one may be padding.

    Depending on the sizes and alignments of the various types, sure.
    So what?


    If you have suggestions for alternate ways to define #embed, they might >>>>>> be interesting, but it's too late to change the existing specification. >>>>>>

    My early comments on this were about compiler performance. I suggested >>>>> there might be a way to turn 100,000 byte values in a file, directly >>>>> into a 100KB string or data block, without needing to first convert
    100,000 values into 100,000 integer expressions representated as tokens, >>>>> and to then parse those 100,000 expressions into AST nodes etc.

    DB suggested something like that was actually done. But you can't do >>>>> that if those 100,000 numbers represent from 100KB to 800KB of memory >>>>> depending on the data type of the strucure they're initialising.

    An implementation is free to simply pass a variant (or the directive
    itself) of #embed from the pre-processor to the compiler if the programmer >>>> isn't using -E, and the compiler could simply copy the embedded file
    into the object file directly, without processing it as a series of
    integer values. Much like the #file and #line directives passed by
    the pre-processor to the compiler.

    Sure, an implementation has to operate *as if* it implemented the 8
    translation phases separately. But given a structure initialized with
    #embed, it would have to generate additional code to initialize the
    structure members from the bytes of the binary blob.

    Would it? Or could it simply assume that the binary blob
    is already in the same binary format that writing an instance
    of the structure from a C application on the same host would have created?

    That would depend on the sizes of the fields in the struct, and the size
    of the integer constants in the #embed.

    I'm embedding a binary file. I want the representation in memory
    to be _exactly_ the same as in the file, regardless of how it is
    defined in the C code (array of char, array of int, array of long, struct whatever).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Scott Lurndal on Thu Feb 29 18:09:52 2024
    scott@slp53.sl.home (Scott Lurndal) writes:
    David Brown <david.brown@hesbynett.no> writes:
    On 29/02/2024 18:28, Scott Lurndal wrote:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
    scott@slp53.sl.home (Scott Lurndal) writes:

    An implementation is free to simply pass a variant (or the directive >>>>> itself) of #embed from the pre-processor to the compiler if the programmer
    isn't using -E, and the compiler could simply copy the embedded file >>>>> into the object file directly, without processing it as a series of
    integer values. Much like the #file and #line directives passed by
    the pre-processor to the compiler.

    Sure, an implementation has to operate *as if* it implemented the 8
    translation phases separately. But given a structure initialized with >>>> #embed, it would have to generate additional code to initialize the
    structure members from the bytes of the binary blob.

    Would it? Or could it simply assume that the binary blob
    is already in the same binary format that writing an instance
    of the structure from a C application on the same host would have created? >>
    That would depend on the sizes of the fields in the struct, and the size
    of the integer constants in the #embed.

    I'm embedding a binary file. I want the representation in memory
    to be _exactly_ the same as in the file, regardless of how it is
    defined in the C code (array of char, array of int, array of long, struct whatever).


    I have an actual use case today where #embed of a (C++) std::map binary
    object created by separate tool would be very useful. I'm
    planning on using mmap to load it at runtime at the moment.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Keith Thompson on Thu Feb 29 19:08:22 2024
    On 29/02/2024 18:03, Keith Thompson wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 28/02/2024 22:57, Keith Thompson wrote:
    David Brown <david.brown@hesbynett.no> writes:
    [...]
    They won't use strings, they will use data blobs - binary data. Then
    there is no issue with null bytes. And yes, implementations will skip >>>> the token generation (unless you are doing something weird, such as
    using #embed to read the parameters to a function call).

    Tests with prototype implementations gave extremely fast results.
    I'm not sure how that would work. #embed is a preprocessor
    directive,
    and at least in the abstract model it has to expand to valid C code.
    I would have expected that it would simply generate the list of
    comma-separated integer constants described in the standard; later
    phases would simply parse that list and generate code as if that
    sequence had been written in the original source file. Do you know of
    an implementation that does something else?

    The key thing, as I understand it, is that the compiler gets to know
    that the integers in the list are all "nice". And since the
    preprocessor and the compiler are part of the same implementation
    (even if they are separate programs communicating with pipes or
    temporary files), the preprocessor could pass on the binary blob in a
    pre-parsed form.
    [...]

    Sure, an implementation *could* optimize #embed so it expands to some implementation-defined nonstandard form that later phases can treat as
    raw data. But since it's defined as a preprocessor directive, it's
    difficult to see how it could do so while covering all cases.


    It would require a strong link between the compiler and the preprocessor
    - as you know, these don't have to be separate programs. In a more
    weakly coupled system, there could still be a method for passing a
    binary blob to the compiler in addition to the integer data, and let the compiler use whichever form it preferred (based on what your code does
    with the data).

    [...]

    The results of testing are that #embed is /massively/ faster and lower
    memory compared to external generators, especially for larger files.
    And it gives you the data on-hand for optimisation purposes, unlike
    external direct linking of binary blobs. (So you can get the size of
    the array, or use values from it as compile-time known values.)

    What testing? The very latest versions of gcc and clang (I checked both their git repos yesterday) do not yet implement #embed.


    I believe prototypes, tests, or proofs of concept have been made for
    gcc, clang and perhaps other tools. I posted a link to some results -
    more are floating around the internet if you want to look for them.

    For example, say you have a file "foo.dat" containing 4 bytes with
    values 0, 1, 2, and 3. This would be perfectly valid:
    struct foo {
    unsigned char a;
    unsigned short b;
    unsigned int c;
    double d;
    };
    struct foo obj = {
    #embed "foo.dat"
    };
    #embed isn't defined to translate an input file to a sequence of
    bytes.
    It's defined to translate an input file to a sequence of integer
    constant expressions.

    Yes. But the prime speed (and memory usage) gains come in, are for
    large files, and that means array initialisers. That does not
    conflict with using it for cases like yours.

    So a compiler that does this would have to be able to handle

    struct foo obj = {
    #blob
    <binary data>
    #endblob>
    };

    and initialize a, b, c, and d to 0, 1, 2, and 3.0, respectively from successive bytes of the binary data. Either that, or the preprocessor
    would have to use information it doesn't have to determine how to expand #embed.


    I think I've covered how that could be handled. (And I don't know how
    it /will/ be handled. But I am sure compiler implementers will figure a
    way to make it work correctly for any use of the integer constant list,
    while also making it as efficient as they reasonably can for the common
    case of initialising an unsigned char array.)

    *Maybe* a compiler could optimize for the case where it knows that it's
    being used to initialize an array of unsigned char, but (a) that would
    require the preprocessor to have information that normally doesn't exist >>> until later phases, and (b) I'm not convinced it would be worth the
    effort.

    Look at
    <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1040r6.html#design-practice-speed>.

    In those tests, for a 40 MB file gcc #embed is 200 times faster than
    "xxd -i" generated files, and takes about 2.5% of the memory. It
    scales to 1 GB files. And that's just a proof-of-concept
    implementation.

    That's for std::embed, a proposed C++ feature that's *not* defined as a preprocessor directive. Sample usage from the paper:

    constexpr std::span<const std::byte> fxaa_binary =
    std::embed( "fxaa.spirv" );

    So the compiler knows the type of the object being initialized.

    (Note that the author of that C++ paper is also the editor for the C standard.)

    The work on #embed is being done simultaneously for C and C++.
    std::embed() gives you slightly different way to write it, but the implementation is the same. (Not unlike _Pragma and #pragma in C.)

    Other pages I have seen with speed tests show the same pattern while
    referring explicitly to #embed.


    I'm still skeptical that C's #embed will actually be implemented other
    than as expanding to a sequence of integer constants.


    We'll see when it all hits the mainline compilers!

    On the other hand, C23 allows for additional implementation-defined parameters to #embed (as well as the standard embed parameters limit,
    prefix, suffix, and is_empty). Such a parameter could specify how it's expanded, perhaps to some implementation-defined blob format. *If*
    compilers optimize #embed to something other than a sequence of integer constant expressions, that's probably how it would be done. But since neither gcc nor clang implements #embed at all, it may be too early to speculate.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Kaz Kylheku on Thu Feb 29 14:45:44 2024
    On 2/29/24 14:26, Kaz Kylheku wrote:
    On 2024-02-29, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Exactly, "string" is not a type.

    It is a type in the broader sense, in that is a logical proposition
    about the attributes of an object that is true or false.

    If I defined something to be a sequence of floating point numbers
    terminated by a NaN, would that thing also qualify as a type, according
    to the definition you're using?

    Could you give a source for the definition of "type" that you're using?
    Can you use the word "type" in a statement whose truth relies upon the difference between that definition and the way that "type" is defined by
    the C standard? Preferably it would be a useful statement that applies to C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Scott Lurndal on Thu Feb 29 20:51:11 2024
    On 29/02/2024 19:05, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 29/02/2024 18:28, Scott Lurndal wrote:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    bart <bc@freeuk.com> writes:
    On 28/02/2024 23:31, Keith Thompson wrote:
    bart <bc@freeuk.com> writes:

    It would be unfortunate if your example was allowed. Clearly a binary >>>>>>>> representation of an instance of your struct would probably require 16 >>>>>>>> bytes rather than 4, of which one may be padding.

    Depending on the sizes and alignments of the various types, sure. >>>>>>> So what?


    If you have suggestions for alternate ways to define #embed, they might >>>>>>> be interesting, but it's too late to change the existing specification. >>>>>>>

    My early comments on this were about compiler performance. I suggested >>>>>> there might be a way to turn 100,000 byte values in a file, directly >>>>>> into a 100KB string or data block, without needing to first convert >>>>>> 100,000 values into 100,000 integer expressions representated as tokens, >>>>>> and to then parse those 100,000 expressions into AST nodes etc.

    DB suggested something like that was actually done. But you can't do >>>>>> that if those 100,000 numbers represent from 100KB to 800KB of memory >>>>>> depending on the data type of the strucure they're initialising.

    An implementation is free to simply pass a variant (or the directive >>>>> itself) of #embed from the pre-processor to the compiler if the programmer
    isn't using -E, and the compiler could simply copy the embedded file >>>>> into the object file directly, without processing it as a series of
    integer values. Much like the #file and #line directives passed by
    the pre-processor to the compiler.

    Sure, an implementation has to operate *as if* it implemented the 8
    translation phases separately. But given a structure initialized with >>>> #embed, it would have to generate additional code to initialize the
    structure members from the bytes of the binary blob.

    Would it? Or could it simply assume that the binary blob
    is already in the same binary format that writing an instance
    of the structure from a C application on the same host would have created? >>
    That would depend on the sizes of the fields in the struct, and the size
    of the integer constants in the #embed.

    I'm embedding a binary file. I want the representation in memory
    to be _exactly_ the same as in the file, regardless of how it is
    defined in the C code (array of char, array of int, array of long, struct whatever).


    Then you would want a union of the struct type and an appropriately
    sized unsigned char array, and initialise the unsigned char area with
    the bytes of the file using #embed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Keith Thompson on Thu Feb 29 19:26:42 2024
    On 2024-02-29, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 29.02.2024 17:18, Keith Thompson wrote:
    "abc\0def" is a valid string literal, but its value is not a string.
    (No, the standard doesn't say that the value of a string literal is a
    string.)

    This sounds somewhat strange in my ears. Usually a literal for a type
    will constitute an instance of the type. - I suppose the irregularity
    stems from the fact that there's no explicit string object type in C.

    Exactly, "string" is not a type.

    It is a type in the broader sense, in that is a logical proposition
    about the attributes of an object that is true or false.

    It's just not a type in the C static type system.

    What that means is that there does not exist a constraint rule in
    standard C requiring some expression or object to conform to the string
    type. The concept "string" is not represented in the constraint system.
    But it is a type concept.

    (There are rules that require a string, but they are not constraint
    rules. E.g. if strlen is given an argument which isn't a string, the
    behavior is undefined.)

    Consider:

    char a[3] = "abc";
    size_t l = strlen(a);

    In the unlikely event that this example would capture the attention of a computer scientist who researches type systems, he or she would identify
    that as having a type error. (One that the C type system is too weak to
    model.)

    "Upper case letter" is also a type; that's why the header is called
    <ctype.h>.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to David Brown on Thu Feb 29 21:05:02 2024
    On Thu, 29 Feb 2024 08:58:40 +0100, David Brown wrote:

    On 28/02/2024 21:56, Lawrence D'Oliveiro wrote:

    On Wed, 28 Feb 2024 12:50:10 +0100, David Brown wrote:

    ... people write utilities for them in a variety of languages ...

    But it will often be more convenient to have it built into the
    language and compiler.

    What can be built into the language can only ever be a small subset of
    the many and varied ways that people have incorporated data blobs into
    their programs.

    Of course. But that doesn't mean that a language should not include a feature that makes it easy for a lot of people to get some data blobs
    into their code.

    Maybe the C compiler should concentrate on compiling C code, and leave it
    to the rest of the build toolchain to deal with other data.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Thu Feb 29 21:27:52 2024
    On Thu, 29 Feb 2024 18:09:52 GMT, Scott Lurndal wrote:

    I have an actual use case today where #embed of a (C++) std::map binary object created by separate tool would be very useful. I'm planning on
    using mmap to load it at runtime at the moment.

    Why not convert it to a .o file and statically link it into your program
    as part of the build process?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Keith Thompson on Thu Feb 29 21:44:20 2024
    On 29/02/2024 21:20, Keith Thompson wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 29.02.2024 17:17, Scott Lurndal wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 29.02.2024 16:48, Scott Lurndal wrote:

    int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}

    What does it do?

    What preconditions must be fulfilled or what additions
    does it need to compile?

    With the link below I see it "needs" a 600+ lines long Makefile.

    The readme simply says compile it and run it
    as ./prog <value between 1 and 512>.

    No, you have to compile it with specific command-line arguments to
    define B and I. The Makefile does that (don't ask me why it's so long),
    but you can do it manually.

    From hint.txt:
    """
    On a little-endian machine:

    clang -include stdio.h -include stdlib.h -Wall -Weverything -pedantic -DB=6945503773712347754LL -DI=5859838231191962459LL -DT=0 -DS=7 -o prog prog.c

    On a big-endian machine:

    clang -include stdio.h -include stdlib.h -Wall -Weverything -pedantic -DB=7091606191627001958LL -DI=6006468689561538903LL -DT=1 -DS=0 -o prog.be prog.c
    """



    In't it cheating when half the program is part of the build
    instructions? Here is a complete standalone program:

    ----------------
    #include <stdio.h>
    #include <stdlib.h>
    #define B 6945503773712347754LL
    #define I 5859838231191962459LL
    #define T 0
    #define S 7
    int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}
    ----------------

    It's 261 bytes. The 'one-liner' that was posted was 134 bytes.

    (It appears to print an input of 0 to 255 as binary.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to David Brown on Thu Feb 29 21:36:05 2024
    On Thu, 29 Feb 2024 16:19:45 +0100, David Brown wrote:

    An array of bytes is not a "string".

    It is in PHP, I think also in Perl, and also in (obsolete) Python 2.

    And what about C string functions that take explicit lengths?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Lawrence D'Oliveiro on Fri Mar 1 09:16:00 2024
    On 29/02/2024 22:05, Lawrence D'Oliveiro wrote:
    On Thu, 29 Feb 2024 08:58:40 +0100, David Brown wrote:

    On 28/02/2024 21:56, Lawrence D'Oliveiro wrote:

    On Wed, 28 Feb 2024 12:50:10 +0100, David Brown wrote:

    ... people write utilities for them in a variety of languages ...

    But it will often be more convenient to have it built into the
    language and compiler.

    What can be built into the language can only ever be a small subset of
    the many and varied ways that people have incorporated data blobs into
    their programs.

    Of course. But that doesn't mean that a language should not include a
    feature that makes it easy for a lot of people to get some data blobs
    into their code.

    Maybe the C compiler should concentrate on compiling C code, and leave it
    to the rest of the build toolchain to deal with other data.

    It is possible to be actively involved in the development of the
    standards - preparing and discussing proposals, joining committees, or
    at least joining mailing lists for the discussions. If you are not
    doing the work and showing the interest /before/ decisions are made, you
    don't get a say afterwards. It is more productive to discuss what you
    can do with the features C has, than to wish it never had them.

    Oh, and the reason C23 has #embed, is because people want it. It is
    something C developers have asked for for many years. /You/ might not
    have use of it, but that's true of lots of features of C for all
    programmers - no one needs everything in the language and standard library.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to Lawrence D'Oliveiro on Fri Mar 1 11:52:16 2024
    On 29/02/2024 21:27, Lawrence D'Oliveiro wrote:
    On Thu, 29 Feb 2024 18:09:52 GMT, Scott Lurndal wrote:

    I have an actual use case today where #embed of a (C++) std::map binary
    object created by separate tool would be very useful. I'm planning on
    using mmap to load it at runtime at the moment.

    Why not convert it to a .o file and statically link it into your program
    as part of the build process?

    That's exactly what #embed will enable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From bart@21:1/5 to tTh on Fri Mar 1 11:58:49 2024
    On 29/02/2024 15:34, tTh wrote:
    On 2/29/24 11:18, bart wrote:
    Using 'strinclude' in my old C compiler, it took about 1 second to
    build this program:

       #include <stdio.h>
       #include <string.h>

       char* s=strinclude("data");

       int main(void) {
          printf("%zu\n", strlen(s));
      }

    tth@redlady:~/Desktop$ man strinclude
    No manual entry for strinclude
    tth@redlady:~/Desktop$


    'strinclude' is an extension I made for that compiler.

    #embed is the new feature of C23. Although I'm not sure how it would be
    used to initialise a char* pointer. Perhaps like this:

    char dummy[] {
    #embed "data"
    ,0};
    char* s = dummy;

    (I've added a 0-terminator here; I don't know if #embed will take care
    of that.)

    My 'strinclude' produces a zero-terminated string, but it is done within
    the parser rather than lexer.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Harnden@21:1/5 to Lawrence D'Oliveiro on Fri Mar 1 12:59:43 2024
    On 29/02/2024 21:36, Lawrence D'Oliveiro wrote:
    On Thu, 29 Feb 2024 16:19:45 +0100, David Brown wrote:

    An array of bytes is not a "string".

    It is in PHP, I think also in Perl, and also in (obsolete) Python 2.

    And what about C string functions that take explicit lengths?

    You mean: There's a danger that a function that returns a 'string', but truncates it to n chars, might not be returning a string at all ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to bart on Fri Mar 1 13:17:01 2024
    On 01/03/2024 12:58, bart wrote:
    On 29/02/2024 15:34, tTh wrote:
    On 2/29/24 11:18, bart wrote:
    Using 'strinclude' in my old C compiler, it took about 1 second to
    build this program:

       #include <stdio.h>
       #include <string.h>

       char* s=strinclude("data");

       int main(void) {
          printf("%zu\n", strlen(s));
      }

    tth@redlady:~/Desktop$ man strinclude
    No manual entry for strinclude
    tth@redlady:~/Desktop$


    'strinclude' is an extension I made for that compiler.

    #embed is the new feature of C23. Although I'm not sure how it would be
    used to initialise a char* pointer. Perhaps like this:

        char dummy[]  {
        #embed "data"
        ,0};
        char* s = dummy;

    (I've added a 0-terminator here; I don't know if #embed will take care
    of that.)

    #embed very specifically does not add anything. So you would do :

    const char s[] = {
    #embed "data" suffix(,)
    0
    };

    The "suffix" parameter adds a comma if "data" is not empty, and does
    nothing if "data" is empty. Writing it as you did would work fine for non-empty "data" but give the nonsensical results {,0} if "data" is
    empty. (You might not care about such cases and prefer to write the
    simpler version, but now you also know about "suffix".)

    There is no need to have a separate character pointer variable - the
    const char array can be used directly in most circumstances.


    My 'strinclude' produces a zero-terminated string, but it is done within
    the parser rather than lexer.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to David Brown on Fri Mar 1 16:55:51 2024
    On 2024-03-01, David Brown <david.brown@hesbynett.no> wrote:
    It is possible to be actively involved in the development of the
    standards - preparing and discussing proposals, joining committees, or
    at least joining mailing lists for the discussions. If you are not
    doing the work and showing the interest /before/ decisions are made, you don't get a say afterwards. It is more productive to discuss what you
    can do with the features C has, than to wish it never had them.

    Also, if you don't join the gang that breaks windows and spray
    paints walls, you don't get to say aftward which windows are broken
    and what is scribbled on what wall.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Keith Thompson on Fri Mar 1 18:09:05 2024
    On 29.02.2024 23:06, Keith Thompson wrote:
    bart <bc@freeuk.com> writes:
    [...]
    In't it cheating when half the program is part of the build
    instructions?

    I recall from decades ago (when I looked into this contest) that they
    even had a contribution that fed the whole C program into the compiler
    through compiler options. (I think it even got a prize.)

    "Is it cheating?" - I'd say no, since it was accepted.

    Is it really about an "obfuscated C code"? - I'd say no. (But it was
    anyway a curiosity.)


    Apparently not. If it were, the judges of the IOCCC would not have
    accepted it.
    [...]

    One of the winners of the 1988 contest was:
    ```
    #include "/dev/tty"

    This is great! :-)

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Kaz Kylheku on Fri Mar 1 18:28:05 2024
    On 01/03/2024 17:55, Kaz Kylheku wrote:
    On 2024-03-01, David Brown <david.brown@hesbynett.no> wrote:
    It is possible to be actively involved in the development of the
    standards - preparing and discussing proposals, joining committees, or
    at least joining mailing lists for the discussions. If you are not
    doing the work and showing the interest /before/ decisions are made, you
    don't get a say afterwards. It is more productive to discuss what you
    can do with the features C has, than to wish it never had them.

    Also, if you don't join the gang that breaks windows and spray
    paints walls, you don't get to say aftward which windows are broken
    and what is scribbled on what wall.


    A slightly closer version of that feeble analogy would be that you don't
    get to say they should have used a different colour, or broken doors
    instead of windows.

    It's okay for Lawrence (or anyone else) to say that don't approve of
    #embed, or don't think they will use it themselves. But like most
    (probably all) features in newer C standards, it was added because
    enough people wanted it for the committee and connected developers to do
    the work designing and documenting the features, and testing prototypes
    in practice.

    There are procedures in place for people to have an influence on the
    future of C. If you want to have your say, you can have it. But
    waiting until a new standard version is solidified and then complaining
    that you don't like the direction it is taking, is too late. Whining
    about things here afterwards doesn't do anyone any good.

    That's different from saying you don't like the feature, or you don't
    like the way C is heading, or you won't use it yourself. And it's
    different from talking about it, trying to learn how a new feature works
    and how to make the best of it. Such discussions are great, and I'd
    love to see more of them here in c.l.c.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Richard Harnden on Fri Mar 1 20:59:11 2024
    On Fri, 1 Mar 2024 12:59:43 +0000, Richard Harnden wrote:

    You mean: There's a danger that a function that returns a 'string', but truncates it to n chars, might not be returning a string at all ?

    If it’s not NUL-terminated, then it’s not a “string”, right?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Keith Thompson on Fri Mar 1 22:06:03 2024
    On 01.03.2024 19:49, Keith Thompson wrote:

    Like most Abuse of the Rules winners, it resulted in a rule change for
    the following years.

    Makes sense.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to bart on Tue Mar 5 04:47:18 2024
    On Fri, 1 Mar 2024 11:52:16 +0000, bart wrote:

    On 29/02/2024 21:27, Lawrence D'Oliveiro wrote:

    On Thu, 29 Feb 2024 18:09:52 GMT, Scott Lurndal wrote:

    I have an actual use case today where #embed of a (C++) std::map
    binary object created by separate tool would be very useful. I'm
    planning on using mmap to load it at runtime at the moment.

    Why not convert it to a .o file and statically link it into your
    program as part of the build process?

    That's exactly what #embed will enable.

    You can call it a toy version of objcopy <https://manpages.debian.org/1/objcopy.1.html>.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Keith Thompson on Tue Mar 5 04:48:38 2024
    On Thu, 29 Feb 2024 14:14:52 -0800, Keith Thompson wrote:

    "A *string* is a contiguous sequence of characters
    terminated by and including the first null character."

    So how come strlen(3) does not include the null?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Tue Mar 5 15:09:06 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Fri, 1 Mar 2024 11:52:16 +0000, bart wrote:

    On 29/02/2024 21:27, Lawrence D'Oliveiro wrote:

    On Thu, 29 Feb 2024 18:09:52 GMT, Scott Lurndal wrote:

    I have an actual use case today where #embed of a (C++) std::map
    binary object created by separate tool would be very useful. I'm
    planning on using mmap to load it at runtime at the moment.

    Why not convert it to a .o file and statically link it into your
    program as part of the build process?

    That's exactly what #embed will enable.

    You can call it a toy version of objcopy ><https://manpages.debian.org/1/objcopy.1.html>.

    While objcopy supports a number of ways to
    manipulate an ELF file, I wouldn't equate it
    with #embed at all.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Wed Mar 6 01:49:36 2024
    On Tue, 05 Mar 2024 15:09:06 GMT, Scott Lurndal wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:

    On Fri, 1 Mar 2024 11:52:16 +0000, bart wrote:

    On 29/02/2024 21:27, Lawrence D'Oliveiro wrote:

    On Thu, 29 Feb 2024 18:09:52 GMT, Scott Lurndal wrote:

    I have an actual use case today where #embed of a (C++) std::map
    binary object created by separate tool would be very useful. I'm
    planning on using mmap to load it at runtime at the moment.

    Why not convert it to a .o file and statically link it into your
    program as part of the build process?

    That's exactly what #embed will enable.

    You can call it a toy version of objcopy >><https://manpages.debian.org/1/objcopy.1.html>.

    While objcopy supports a number of ways to manipulate an ELF file, I
    wouldn't equate it with #embed at all.

    It does a whole lot more.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Keith Thompson on Thu Mar 7 21:08:48 2024
    On Mon, 04 Mar 2024 20:55:28 -0800, Keith Thompson wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:

    On Thu, 29 Feb 2024 14:14:52 -0800, Keith Thompson wrote:

    "A *string* is a contiguous sequence of characters terminated by and
    including the first null character."

    So how come strlen(3) does not include the null?

    Because the *length of a string* is by definition "the number of bytes preceding the null character".

    So the “string” itself includes the null character, but its “length” does
    not?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Lawrence D'Oliveiro on Thu Mar 7 21:44:06 2024
    On 2024-03-07, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Mon, 04 Mar 2024 20:55:28 -0800, Keith Thompson wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:

    On Thu, 29 Feb 2024 14:14:52 -0800, Keith Thompson wrote:

    "A *string* is a contiguous sequence of characters terminated by and
    including the first null character."

    So how come strlen(3) does not include the null?

    Because the *length of a string* is by definition "the number of bytes
    preceding the null character".

    So the “string” itself includes the null character, but its “length” does
    not?

    That's correct. However, its size includes it.

    sizeof "abc" == 4

    strlen("abc") == 3

    The abstract string does not include the null character;
    we understand "abc" to be a three character string.

    The C representation of the string includes the null character;
    the size is a representational concept so it counts it.

    It is common for C programs to break encapsulation and openly deal with
    that terminating null.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Keith Thompson on Thu Mar 7 23:00:20 2024
    On 2024-03-07, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Kaz Kylheku <433-929-6894@kylheku.com> writes:
    On 2024-03-07, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Mon, 04 Mar 2024 20:55:28 -0800, Keith Thompson wrote:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Thu, 29 Feb 2024 14:14:52 -0800, Keith Thompson wrote:
    "A *string* is a contiguous sequence of characters terminated by and >>>>>> including the first null character."

    So how come strlen(3) does not include the null?

    Because the *length of a string* is by definition "the number of bytes >>>> preceding the null character".

    So the “string” itself includes the null character, but its “length” does
    not?

    That's correct. However, its size includes it.

    sizeof "abc" == 4

    strlen("abc") == 3

    The abstract string does not include the null character;
    we understand "abc" to be a three character string.

    Sure, if you define "abstract string" that way. I'll just note that C's definition of the word "string" does include the terminating null
    character, and does not talk about "abstract strings". (A string in the abstract machine clearly includes the null character, but that's a bit
    of a stretch.)

    Yes; "abstract machine" is not what I mean by abstract.

    The concept of the abstract string lives in the semantics though.

    When N strings are catenated together, their abstract strings are
    juxtaposed together without any nulls in between, with only a single
    null at the end.

    Furthermore, when a string is sent to a stream with %s or {f}puts,
    the null byte is omitted, like in the calculation of length.

    Clearly, there is a semantics that the part before the null byte
    is the text processing payload; what I'm calling the abstract string.

    (With character encodings, it gets hairy. The part before the null
    may be a UTF-8 sequence, where the abstract string consists of code
    points. Which may be combining characters, so the True Scotsman's
    abstract string is the sequence of characters.)
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Harnden@21:1/5 to Keith Thompson on Fri Mar 8 00:26:04 2024
    On 07/03/2024 22:25, Keith Thompson wrote:
    Kaz Kylheku <433-929-6894@kylheku.com> writes:
    On 2024-03-07, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Mon, 04 Mar 2024 20:55:28 -0800, Keith Thompson wrote:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Thu, 29 Feb 2024 14:14:52 -0800, Keith Thompson wrote:
    "A *string* is a contiguous sequence of characters terminated by and >>>>>> including the first null character."

    So how come strlen(3) does not include the null?

    Because the *length of a string* is by definition "the number of bytes >>>> preceding the null character".

    So the “string” itself includes the null character, but its “length” does
    not?

    That's correct. However, its size includes it.

    sizeof "abc" == 4

    strlen("abc") == 3

    The abstract string does not include the null character;
    we understand "abc" to be a three character string.

    Sure, if you define "abstract string" that way. I'll just note that C's definition of the word "string" does include the terminating null
    character, and does not talk about "abstract strings". (A string in the abstract machine clearly includes the null character, but that's a bit
    of a stretch.)

    A string is just a data format.

    You have a string of chars, terminated by a '\0'.

    You can have a "string" of anything, terminated by a NULL.
    Everyone's used to argv, for example.


    Yes, I'm being annoyingly pedantic.

    The C representation of the string includes the null character;
    the size is a representational concept so it counts it.

    It is common for C programs to break encapsulation and openly deal with
    that terminating null.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)