Forum: >>> Magnum BBS <<<

Implicit String-Literal Concatenation

From Lawrence D'Oliveiro@21:1/5 to All on Sat Feb 24 23:05:47 2024

I like using this for long strings:

fputs
(
"When an uncleft or a bulkbit wins one or more bernstonebits above\n"
"its own, it takes on a backward lading. When it loses one or\n"
"more, it takes on a forward lading. Such a mote is called a\n"
"*farer*, for that the drag between unlike ladings flits it. When\n"
"bernstonebits flit by themselves, it may be as a bolt of\n"
"lightning, a spark off some faststanding chunk, or the everyday\n"
"flow of bernstoneness through wires.\n",
stdout
);

Of languages that derive ideas from C, only C++ and Python seem to have
kept this. Java, JavaScript and PHP have not, for some reason.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Lawrence D'Oliveiro on Sun Feb 25 17:38:38 2024

On 25.02.2024 00:05, Lawrence D'Oliveiro wrote:

I like using this for long strings:

fputs
(
"When an uncleft or a bulkbit wins one or more bernstonebits above\n"
"its own, it takes on a backward lading. When it loses one or\n"
"more, it takes on a forward lading. Such a mote is called a\n"
"*farer*, for that the drag between unlike ladings flits it. When\n"
"bernstonebits flit by themselves, it may be as a bolt of\n"
"lightning, a spark off some faststanding chunk, or the everyday\n"
"flow of bernstoneness through wires.\n",
stdout
);

I also liked to be able to use this feature _in some cases_ in C++.

Not in the given case, though, where I like to more clearly see the
newlines, so I'd prefer cout << "..." << endl

Of languages that derive ideas from C, only C++ and Python seem to have
kept this. Java, JavaScript and PHP have not, for some reason.

In Java you have at least the string concatenation operator + which
is, IMO, pretty good for that line structuring across source lines.

In Awk (another "C like"), string concatenations have no visible
operators so we can for example write print "Hell" "o " "world"
But since lines have a much more restricted definition you cannot
without line continuation escape spread these strings across many
lines. (It's not too bad to add a terminating '\' where desired.)

As far as you're asking "for some reason", I could just speculate
(and abstain).

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Blue-Maned_Hawk@21:1/5 to All on Sun Feb 25 16:45:20 2024

I've used this to make strings with embedded newlines look in the source
file closer to how they'd look on output.

--
Blue-Maned_HawkÃÃÃÃÂ¢shortens to HawkÃÃÃÃÂ¢/ blu.mÃin.dÃÃÃÃÃÂ°ak/ ÃÃÃÃÂ¢he/him/his/himself/Mr. blue-maned_hawk.srht.site
2017 called, but i couldn't understand what they were saying over all the screams.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Janis Papanagnou on Sun Feb 25 20:43:31 2024

On Sun, 25 Feb 2024 17:38:38 +0100, Janis Papanagnou wrote:

In Java you have at least the string concatenation operator + which is,
IMO, pretty good for that line structuring across source lines.

Implicit concatenation works well in Python because you also have the
“%” operator overloaded to perform printf-style formatting with a
string. If you had to use “+” then, because that binds less tightly
than “%”, you would have to have parentheses as well, which are
unnecessary with implicit concatenation. E.g.

# depreciation entries
sql.cursor.execute \
(
"insert into payments set when_made = %(when_made)s,"
" description = %(description)s, other_party_name = \"\","
" amount = %(amount)d, kind = \"D\", tax_year = %(tax_year)d"
%
{
"when_made" : end_for_tax_year(tax_year) - 1,
"description" :
sql_string
(
"%s: %s $%s at %d%% from %s"
%
(
entry["description"],
entry["method"],
format_amount(entry["initial_value"]),
entry["rate"],
format_date(entry["when_purchased"]),
)
),
"amount" : - entry["amount"],
"tax_year" : tax_year,
}
)

Or, for added fun, how about parameterizing a format:

num_format = "%%.%dg" % nr_digits
...
for axis in range(3) :
out.write \
(
" (%s, %s),\n"
%
(num_format, num_format)
%
(
min(v.co[axis] for v in the_mesh.vertices),
max(v.co[axis] for v in the_mesh.vertices)
)
)
#end for

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Lawrence D'Oliveiro on Sun Feb 25 20:25:09 2024

On Sat, 24 Feb 2024 23:05:47 -0000 (UTC), Lawrence D'Oliveiro wrote:

Of languages that derive ideas from C, only C++ and Python seem to have
kept this. Java, JavaScript and PHP have not, for some reason.

I’d forgotten to check Perl; it doesn’t have implicit concatenation
either.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Lawrence D'Oliveiro on Sun Feb 25 21:20:13 2024

On 25/02/2024 20:43, Lawrence D'Oliveiro wrote:

On Sun, 25 Feb 2024 17:38:38 +0100, Janis Papanagnou wrote:

In Java you have at least the string concatenation operator + which is,
IMO, pretty good for that line structuring across source lines.

Implicit concatenation works well in Python because you also have the
“%” operator overloaded to perform printf-style formatting with a
string. If you had to use “+” then, because that binds less tightly
than “%”,

You mean it binds less tightly than implicit concatenation? So that:

"abc" % "def" "ghi" means "abc" % ("def" "ghi")
"abc" % "def" + "ghi" means ("abc" % "def") "ghi"

you would have to have parentheses as well, which are
unnecessary with implicit concatenation. E.g.

# depreciation entries
sql.cursor.execute \
(
"insert into payments set when_made = %(when_made)s,"
" description = %(description)s, other_party_name = \"\","
" amount = %(amount)d, kind = \"D\", tax_year = %(tax_year)d"
%
{
"when_made" : end_for_tax_year(tax_year) - 1,
"description" :
sql_string
(
"%s: %s $%s at %d%% from %s"
%
(
entry["description"],
entry["method"],
format_amount(entry["initial_value"]),
entry["rate"],
format_date(entry["when_purchased"]),
)
),
"amount" : - entry["amount"],
"tax_year" : tax_year,
}
)

Or, for added fun, how about parameterizing a format:

num_format = "%%.%dg" % nr_digits
...
for axis in range(3) :
out.write \
(
" (%s, %s),\n"
%
(num_format, num_format)
%
(
min(v.co[axis] for v in the_mesh.vertices),
max(v.co[axis] for v in the_mesh.vertices)
)
)
#end for

Although I can't see it made much difference here. Is this an example of
how bad it can be without implicit concatenation, or is it this
complicated despite that?

Since I can't see any "+" operators between strings, yet what follows
"%" is usually something starting with "(" or "{", not a string constant.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Mon Feb 26 20:31:18 2024

On Mon, 26 Feb 2024 21:12:39 +0100, Łukasz 'Maly' Ostrowski wrote:

Java (Text Blocks):
String s = """
multi line string""";

Python has those, too. I use them sometimes. Generally I’m not fond of
them, because I think they’re wrongly defined.

JavaScript (Template Literal):
let s = `multi line string`;

I think Python has something like that now, too. F-strings?

Still more convenient than C.

I still like having the choice of implicit concatenation, because then I
fully control what appears in the string.

Tip: I have Emacs macros defined to strip and add the quoting/escaping,
because I find the strings are easier to edit without that.

PHP? Don't care about PHP, it's shit, not even checking, most likely
some kind of a Perl-ish <<<EOF expression.

PHP is shit, not because of what it copied from Perl, but from what it
didn’t copy. Nowadays it is trying to copy from Python, and it is making
the same mistake.

The <<EOD construct that Perl has comes from POSIX shells, and it is very useful in both places. Bash also adds a <<<-construct.

Question: How would you do two separate <<-strings in the same shell
command?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Lawrence D'Oliveiro on Mon Feb 26 20:42:42 2024

On 2024-02-24, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

I like using this for long strings:

fputs
(
"When an uncleft or a bulkbit wins one or more bernstonebits above\n"
"its own, it takes on a backward lading. When it loses one or\n"
"more, it takes on a forward lading. Such a mote is called a\n"
"*farer*, for that the drag between unlike ladings flits it. When\n"
"bernstonebits flit by themselves, it may be as a bolt of\n"
"lightning, a spark off some faststanding chunk, or the everyday\n"
"flow of bernstoneness through wires.\n",
stdout
);

Of languages that derive ideas from C, only C++ and Python seem to have
kept this. Java, JavaScript and PHP have not, for some reason.

Implicit string catenation means you need punctuation to separate
elements that are not catenated.

It's a nonstarter in Lisp where you want

("ab" "cd" "ef")

to have three elements, not one. So if it worked that way, we would
need

("ab", "cd", "ef")

which is too horrible a price to pay for string literal catenation.

ANSI Lisp just allows line breaks in strings. However, all the white
space is combined into it.

Allow line breaks in string literals means that if you forget to
close a quote, it might not be diagnosed until the end of file!
The strictness of having to close a string in the same line is
worthwhile for diagnosis.

In TXR Lisp, I solved multiple problems with a backslash continuation.

"abc \
def"

encodes the string "abcdef". All unescaped whitespace around the
backslash is deleted. If you want "abc def", you can plant an escaped
space in there:

"abc \ \
def"

or

"abc \
\ def"

Unfortunately, it does mean we have the run of backslashes down the
right side:

"abc \
def \
ghi \
... "

I can live with that.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mike Sanders@21:1/5 to Lawrence D'Oliveiro on Mon Feb 26 22:03:11 2024

Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

I like using this for long strings:

fputs
(
"When an uncleft or a bulkbit wins one or more bernstonebits above\n"
"its own, it takes on a backward lading. When it loses one or\n"
"more, it takes on a forward lading. Such a mote is called a\n"
"*farer*, for that the drag between unlike ladings flits it. When\n"
"bernstonebits flit by themselves, it may be as a bolt of\n"
"lightning, a spark off some faststanding chunk, or the everyday\n"
"flow of bernstoneness through wires.\n",
stdout
);

Of languages that derive ideas from C, only C++ and Python seem to have
kept this. Java, JavaScript and PHP have not, for some reason.

Easy solution Lawrence. Why not use something like bin2c:

<https://www.segger.com/free-utilities/bin2c/>

void Usage() {

#include "my_text"

printf("%s\n", my_var);

}

--
:wq
Mike Sanders

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Mike Sanders on Mon Feb 26 23:17:36 2024

On Mon, 26 Feb 2024 22:03:11 -0000 (UTC), Mike Sanders wrote:

Easy solution Lawrence. Why not use something like bin2c:

My tool for easy editing of such embedded text is the Emacs macros in multiquote.el, here <https://gitlab.com/ldo/emacs-prefs>.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Mike Sanders on Tue Feb 27 09:36:38 2024

On 26/02/2024 23:03, Mike Sanders wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

I like using this for long strings:

fputs
(
"When an uncleft or a bulkbit wins one or more bernstonebits above\n"
"its own, it takes on a backward lading. When it loses one or\n"
"more, it takes on a forward lading. Such a mote is called a\n"
"*farer*, for that the drag between unlike ladings flits it. When\n"
"bernstonebits flit by themselves, it may be as a bolt of\n"
"lightning, a spark off some faststanding chunk, or the everyday\n" >> "flow of bernstoneness through wires.\n",
stdout
);

Of languages that derive ideas from C, only C++ and Python seem to have
kept this. Java, JavaScript and PHP have not, for some reason.

Easy solution Lawrence. Why not use something like bin2c:

<https://www.segger.com/free-utilities/bin2c/>

Because it generates files that have Segger copyright notices stamped on
them? At least, that's how it appears from that web page.

There are lots of open source alternatives that do similar things, with different variations in the way they generate the output. Or you can
write your own in about 10 lines of Python, which of course makes it a
lot easier to customise to fit your own styles and requirements.

And with C23, we will get #embed, though it is not yet supported by
major tools.

<https://en.cppreference.com/w/c/preprocessor/embed>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Lawrence D'Oliveiro on Tue Feb 27 13:18:20 2024

On 26.02.2024 21:31, Lawrence D'Oliveiro wrote:

The <<EOD construct that Perl has comes from POSIX shells, and it is very useful in both places. Bash also adds a <<<-construct.

Yes, bash adopted the '<<<' "here-strings".

Question: How would you do two separate <<-strings in the same shell
command?

Can you give an example what you intend here? (With what semantics?)

Since '<<' is redirecting the here-document text to stdin of the
command you can have only one channel.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mike Sanders@21:1/5 to David Brown on Tue Feb 27 17:31:36 2024

David Brown <david.brown@hesbynett.no> wrote:

Because it generates files that have Segger copyright notices stamped on them? At least, that's how it appears from that web page.

Then we build our own...

There are lots of open source alternatives that do similar things, with different variations in the way they generate the output. Or you can
write your own in about 10 lines of Python, which of course makes it a
lot easier to customise to fit your own styles and requirements.

Yeah even simpler ways too, sed/awk/etc

And with C23, we will get #embed, though it is not yet supported by
major tools.

<https://en.cppreference.com/w/c/preprocessor/embed>

Did not know that was coming down the pike, thanks for sharing the info
David.

--
:wq
Mike Sanders

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mike Sanders@21:1/5 to Lawrence D'Oliveiro on Tue Feb 27 17:27:51 2024

Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

My tool for easy editing of such embedded text is the Emacs macros in multiquote.el, here <https://gitlab.com/ldo/emacs-prefs>.

Neato-burritto, built his own tool chain, 'atta-boy. Interesting page.

--
:wq
Mike Sanders

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to David Brown on Tue Feb 27 18:56:26 2024

On 27/02/2024 08:36, David Brown wrote:

On 26/02/2024 23:03, Mike Sanders wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

I like using this for long strings:

     fputs
       (
         "When an uncleft or a bulkbit wins one or more bernstonebits
above\n"
         "its own, it takes on a backward lading. When it loses one >>> or\n"
         "more, it takes on a forward lading. Such a mote is called a\n"
         "*farer*, for that the drag between unlike ladings flits it.
When\n"
         "bernstonebits flit by themselves, it may be as a bolt of\n"
         "lightning, a spark off some faststanding chunk, or the >>> everyday\n"
         "flow of bernstoneness through wires.\n",
         stdout
       );

Of languages that derive ideas from C, only C++ and Python seem to have
kept this. Java, JavaScript and PHP have not, for some reason.

Easy solution Lawrence. Why not use something like bin2c:

<https://www.segger.com/free-utilities/bin2c/>

Because it generates files that have Segger copyright notices stamped on them? At least, that's how it appears from that web page.

There are lots of open source alternatives that do similar things, with different variations in the way they generate the output. Or you can
write your own in about 10 lines of Python, which of course makes it a
lot easier to customise to fit your own styles and requirements.

And with C23, we will get #embed, though it is not yet supported by
major tools.

<https://en.cppreference.com/w/c/preprocessor/embed>

Actually I've had such feature, for text files, for some years in my
older compiler:

#include <stdio.h>

int main(void) {
puts(strinclude(__FILE__));
}

This prints out the contents of this sourcefile. Binary files don't work because of embedded zeros, but could have been made to.

Some stuff is just very easy to do; other stuff like designator chains
less easy and also less useful.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to David Brown on Tue Feb 27 20:25:27 2024

On Tue, 27 Feb 2024 09:36:38 +0100, David Brown wrote:

And with C23, we will get #embed, though it is not yet supported by
major tools.

More and more hacks on the preprocessor. Why not just get rid of it and
replace it with something like m4?

Because then you will discover that string-based macros are inherently an unmanageable problem.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Lawrence D'Oliveiro on Tue Feb 27 22:12:23 2024

On 27/02/2024 20:25, Lawrence D'Oliveiro wrote:

On Tue, 27 Feb 2024 09:36:38 +0100, David Brown wrote:

And with C23, we will get #embed, though it is not yet supported by
major tools.

More and more hacks on the preprocessor. Why not just get rid of it and replace it with something like m4?

Because then you will discover that string-based macros are inherently an unmanageable problem.

I hadn't notice that #embed was a preprocessor directive. But that is
not the problem here, it is this:

"The expansion of a #embed directive is a token sequence formed from the
list of integer constant expressions described below."

If a string like "ABC" really is converted to the five tokens 'A' comma
'B' comma 'C', then it's going to make long strings and binary files inefficient.

Embedding a 100KB file will result in a 100KB bigger executable, but
along the way it may have to generate 200,000 tokens within the
compiler, half of them commas. Which in turn will need to be turned into 100,000 integer expressions.

I would hope that implementations find some way of streamlining that
process, perhaps by turning that 100KB of data directly into a 100KB string.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to bart on Tue Feb 27 23:21:28 2024

On 27/02/2024 19:56, bart wrote:

On 27/02/2024 08:36, David Brown wrote:

On 26/02/2024 23:03, Mike Sanders wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

I like using this for long strings:

     fputs
       (
         "When an uncleft or a bulkbit wins one or more
bernstonebits above\n"
         "its own, it takes on a backward lading. When it loses one
or\n"
         "more, it takes on a forward lading. Such a mote is called
a\n"
         "*farer*, for that the drag between unlike ladings flits >>>> it. When\n"
         "bernstonebits flit by themselves, it may be as a bolt of\n"
         "lightning, a spark off some faststanding chunk, or the >>>> everyday\n"
         "flow of bernstoneness through wires.\n",
         stdout
       );

Of languages that derive ideas from C, only C++ and Python seem to have >>>> kept this. Java, JavaScript and PHP have not, for some reason.

Easy solution Lawrence. Why not use something like bin2c:

<https://www.segger.com/free-utilities/bin2c/>

Because it generates files that have Segger copyright notices stamped
on them? At least, that's how it appears from that web page.

There are lots of open source alternatives that do similar things,
with different variations in the way they generate the output. Or you
can write your own in about 10 lines of Python, which of course makes
it a lot easier to customise to fit your own styles and requirements.

And with C23, we will get #embed, though it is not yet supported by
major tools.

<https://en.cppreference.com/w/c/preprocessor/embed>

Actually I've had such feature, for text files, for some years in my
older compiler:

    #include <stdio.h>

    int main(void) {
        puts(strinclude(__FILE__));
    }

This prints out the contents of this sourcefile. Binary files don't work because of embedded zeros, but could have been made to.

Some stuff is just very easy to do; other stuff like designator chains
less easy and also less useful.

The #embed pre-processor directive turns the file into a list of integer constants, one per byte (unless an implementation offers other options).
That makes it a little less convenient for strings than your solution,
but more convenient for data files. There's no harm in supporting both!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Janis Papanagnou on Tue Feb 27 23:10:17 2024

On Tue, 27 Feb 2024 13:18:20 +0100, Janis Papanagnou wrote:

On 26.02.2024 21:31, Lawrence D'Oliveiro wrote:

Question: How would you do two separate <<-strings in the same shell
command?

Can you give an example what you intend here? (With what semantics?)

Since '<<' is redirecting the here-document text to stdin of the command
you can have only one channel.

Perl lets you do something like

func(<<EOD1, <<EOD2);
... contents of first string ...
EOD1
... contents of second string ...
EOD2

But this doesn’t work in Bash. However, in a Posix shell, remember you can specify the number of the file descriptor you want to redirect, e.g.

diff -u /dev/fd/8 /dev/fd/9 8<<'EOD1' 9<<'EOD2'
... contents of first string ...
EOD1
... contents of second string ...
EOD2

Note I add the single quotes to prevent expansion of “$”-sequences within the strings. (I think this might be needed in Perl, too.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Keith Thompson on Tue Feb 27 23:03:08 2024

On Tue, 27 Feb 2024 12:35:35 -0800, Keith Thompson wrote:

(m4? Seriously?)

Do you know of any more powerful string-based macro processor?

The C preprocessor operates on preprocessor tokens, not just strings.

Think of “hygienic” macros in the Lisps, and why that is just impossible
in any string-based preprocessor.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to David Brown on Tue Feb 27 22:52:33 2024

On Tue, 27 Feb 2024 23:21:28 +0100, David Brown wrote:

The #embed pre-processor directive turns the file into a list of integer constants, one per byte (unless an implementation offers other options).

What a waste of time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Lawrence D'Oliveiro on Wed Feb 28 00:50:46 2024

XPost: comp.unix.shell

On 28.02.2024 00:10, Lawrence D'Oliveiro wrote:

On Tue, 27 Feb 2024 13:18:20 +0100, Janis Papanagnou wrote:

On 26.02.2024 21:31, Lawrence D'Oliveiro wrote:

Question: How would you do two separate <<-strings in the same shell
command?

Can you give an example what you intend here? (With what semantics?)

Since '<<' is redirecting the here-document text to stdin of the command
you can have only one channel.

Perl lets you do something like

func(<<EOD1, <<EOD2);
... contents of first string ...
EOD1
... contents of second string ...
EOD2

But this doesn’t work in Bash. However, in a Posix shell, remember you can specify the number of the file descriptor you want to redirect, e.g.

diff -u /dev/fd/8 /dev/fd/9 8<<'EOD1' 9<<'EOD2'
... contents of first string ...
EOD1
... contents of second string ...
EOD2

Note I add the single quotes to prevent expansion of “$”-sequences within the strings. (I think this might be needed in Perl, too.)

I see. - Yes, you can do that in POSIX shells as well. - Note that I set F'up-to CUS. And post the response there as a f'up to this post.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Lawrence D'Oliveiro on Wed Feb 28 01:09:46 2024

On 2024-02-27, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Tue, 27 Feb 2024 23:21:28 +0100, David Brown wrote:

The #embed pre-processor directive turns the file into a list of integer
constants, one per byte (unless an implementation offers other options).

What a waste of time.

Plus easily doable in 1970's Lisp.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to bart on Wed Feb 28 12:54:06 2024

On 27/02/2024 23:12, bart wrote:

On 27/02/2024 20:25, Lawrence D'Oliveiro wrote:

On Tue, 27 Feb 2024 09:36:38 +0100, David Brown wrote:

And with C23, we will get #embed, though it is not yet supported by
major tools.

More and more hacks on the preprocessor. Why not just get rid of it and
replace it with something like m4?

Because then you will discover that string-based macros are inherently an
unmanageable problem.

I hadn't notice that #embed was a preprocessor directive. But that is
not the problem here, it is this:

"The expansion of a #embed directive is a token sequence formed from the
list of integer constant expressions described below."

If a string like "ABC" really is converted to the five tokens 'A' comma
'B' comma 'C', then it's going to make long strings and binary files inefficient.

Embedding a 100KB file will result in a 100KB bigger executable, but
along the way it may have to generate 200,000 tokens within the
compiler, half of them commas. Which in turn will need to be turned into 100,000 integer expressions.

I would hope that implementations find some way of streamlining that
process, perhaps by turning that 100KB of data directly into a 100KB
string.

They won't use strings, they will use data blobs - binary data. Then
there is no issue with null bytes. And yes, implementations will skip
the token generation (unless you are doing something weird, such as
using #embed to read the parameters to a function call).

Tests with prototype implementations gave extremely fast results.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Kaz Kylheku on Wed Feb 28 12:50:10 2024

On 28/02/2024 02:09, Kaz Kylheku wrote:

On 2024-02-27, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Tue, 27 Feb 2024 23:21:28 +0100, David Brown wrote:

The #embed pre-processor directive turns the file into a list of integer >>> constants, one per byte (unless an implementation offers other options).

What a waste of time.

Plus easily doable in 1970's Lisp.

That would be useful, if we were living in the 1970's or if anyone had
wanted to learn Lisp this side of the millennium bug.

As I mentioned before, it's not particularly difficult to do this kind
of manipulation, and people write utilities for them in a variety of
languages, or download a variety of free tools for the job.

But it will often be more convenient to have it built into the language
and compiler. And for those interested in speed, the test
implementations have handled this far more efficiently than other
techniques. Logically, #embed turns the file into a list of numbers. In practice, if you use it for the common case of initialising a const
array of unsigned char, the compiler simply copies and pastes the file
into the output as a binary blob.

It would, IMHO, have been useful also to have had an "embed operator" in
the manner of the "pragma operator", so that it could be used in a macro definition.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to David Brown on Wed Feb 28 13:13:13 2024

On 28/02/2024 11:54, David Brown wrote:

On 27/02/2024 23:12, bart wrote:

On 27/02/2024 20:25, Lawrence D'Oliveiro wrote:

On Tue, 27 Feb 2024 09:36:38 +0100, David Brown wrote:

And with C23, we will get #embed, though it is not yet supported by
major tools.

More and more hacks on the preprocessor. Why not just get rid of it and
replace it with something like m4?

Because then you will discover that string-based macros are
inherently an
unmanageable problem.

I hadn't notice that #embed was a preprocessor directive. But that is
not the problem here, it is this:

"The expansion of a #embed directive is a token sequence formed from
the list of integer constant expressions described below."

If a string like "ABC" really is converted to the five tokens 'A'
comma 'B' comma 'C', then it's going to make long strings and binary
files inefficient.

Embedding a 100KB file will result in a 100KB bigger executable, but
along the way it may have to generate 200,000 tokens within the
compiler, half of them commas. Which in turn will need to be turned
into 100,000 integer expressions.

I would hope that implementations find some way of streamlining that
process, perhaps by turning that 100KB of data directly into a 100KB
string.

They won't use strings, they will use data blobs - binary data. Then
there is no issue with null bytes.

AFAIK strings in C can have embedded zeros when not assumed to be zero-terminated. So here:

char s[]={1,2,3,0,4,5,6};

s will have a length of 7.

And yes, implementations will skip
the token generation (unless you are doing something weird, such as
using #embed to read the parameters to a function call).

What happens if you do -E to preprocess only?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to bart on Wed Feb 28 15:08:52 2024

On 28/02/2024 14:13, bart wrote:

On 28/02/2024 11:54, David Brown wrote:

On 27/02/2024 23:12, bart wrote:

On 27/02/2024 20:25, Lawrence D'Oliveiro wrote:

On Tue, 27 Feb 2024 09:36:38 +0100, David Brown wrote:

And with C23, we will get #embed, though it is not yet supported by
major tools.

More and more hacks on the preprocessor. Why not just get rid of it and >>>> replace it with something like m4?

Because then you will discover that string-based macros are
inherently an
unmanageable problem.

I hadn't notice that #embed was a preprocessor directive. But that is
not the problem here, it is this:

"The expansion of a #embed directive is a token sequence formed from
the list of integer constant expressions described below."

If a string like "ABC" really is converted to the five tokens 'A'
comma 'B' comma 'C', then it's going to make long strings and binary
files inefficient.

Embedding a 100KB file will result in a 100KB bigger executable, but
along the way it may have to generate 200,000 tokens within the
compiler, half of them commas. Which in turn will need to be turned
into 100,000 integer expressions.

I would hope that implementations find some way of streamlining that
process, perhaps by turning that 100KB of data directly into a 100KB
string.

They won't use strings, they will use data blobs - binary data. Then
there is no issue with null bytes.

AFAIK strings in C can have embedded zeros when not assumed to be zero-terminated. So here:

char s[]={1,2,3,0,4,5,6};

s will have a length of 7.

That's not a string, it's an array of char. A "string" in C is "a
contiguous sequence of characters terminated by and including the first
null character". The difference is crucial in respect to the handling
of null bytes. And it is the main reason for #embed generating a comma-separated sequence of integer constants rather than a string. (It
also avoids messy hex character sequences if you show the output of
#embed somewhere.)

And yes, implementations will skip the token generation (unless you
are doing something weird, such as using #embed to read the parameters
to a function call).

What happens if you do -E to preprocess only?

That's something weird :-)

I guess you get the integer list.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to David Brown on Wed Feb 28 20:56:28 2024

On Wed, 28 Feb 2024 12:50:10 +0100, David Brown wrote:

... people write utilities for them in a variety of languages ...

But it will often be more convenient to have it built into the language
and compiler.

What can be built into the language can only ever be a small subset of
the many and varied ways that people have incorporated data blobs into
their programs. Often these will need to have custom structures with
computed header fields, that kind of thing. So you will need custom
build tools to construct these structures, and then you might as well
include those blobs directly into the final build, rather than go
through some extra step of pretending to turn them back into some
source form.

For example, here’s an old Android project of mine (OK, so the app is
Java code, but the same principle applies) <https://bitbucket.org/ldo17/unicode_browser_android/src/master/>
where I wrote a custom Python script to read a Nameslist.txt file
downloaded from unicode.org to generate a table which could be loaded
into memory quickly for easy searching.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Lawrence D'Oliveiro on Wed Feb 28 21:34:14 2024

On 28/02/2024 20:56, Lawrence D'Oliveiro wrote:

On Wed, 28 Feb 2024 12:50:10 +0100, David Brown wrote:

... people write utilities for them in a variety of languages ...

But it will often be more convenient to have it built into the language
and compiler.

What can be built into the language can only ever be a small subset of
the many and varied ways that people have incorporated data blobs into
their programs. Often these will need to have custom structures with
computed header fields, that kind of thing. So you will need custom
build tools to construct these structures, and then you might as well
include those blobs directly into the final build, rather than go
through some extra step of pretending to turn them back into some
source form.

For example, here’s an old Android project of mine (OK, so the app is
Java code, but the same principle applies) <https://bitbucket.org/ldo17/unicode_browser_android/src/master/>
where I wrote a custom Python script to read a Nameslist.txt file
downloaded from unicode.org to generate a table which could be loaded
into memory quickly for easy searching.

I can see now where you get your coding style from. You seem to like
stretching things out vertically as much as possible:

public void Add
(
int CategoryCode,
ItemType Item
)
/* Use this instead of add to populate CodeToIndex table. */
{
CodeToIndex.put(CategoryCode, getCount());
add(Item);
} /*Add*/

In C:

void Add(int CategoryCode, ItemType Item) {
CodeToIndex_put(CategoryCode, getCount());
add(Item);
}

4 non-comment lines versus 9. I know Java needs tons of boilerplate, but
but it is not all the language's fault.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Keith Thompson on Wed Feb 28 23:01:22 2024

On 28/02/2024 21:57, Keith Thompson wrote:

David Brown <david.brown@hesbynett.no> writes:
[...]

They won't use strings, they will use data blobs - binary data. Then
there is no issue with null bytes. And yes, implementations will skip
the token generation (unless you are doing something weird, such as
using #embed to read the parameters to a function call).

Tests with prototype implementations gave extremely fast results.

I'm not sure how that would work. #embed is a preprocessor directive,
and at least in the abstract model it has to expand to valid C code.

I would have expected that it would simply generate the list of comma-separated integer constants described in the standard; later
phases would simply parse that list and generate code as if that
sequence had been written in the original source file. Do you know of
an implementation that does something else?

For example, say you have a file "foo.dat" containing 4 bytes with
values 0, 1, 2, and 3. This would be perfectly valid:

struct foo {
unsigned char a;
unsigned short b;
unsigned int c;
double d;
};

struct foo obj = {
#embed "foo.dat"
};

It would be unfortunate if your example was allowed. Clearly a binary representation of an instance of your struct would probably require 16
bytes rather than 4, of which one may be padding.

Certainly if you were to write it out to disk as binary, it would need
more than 4.

#embed isn't defined to translate an input file to a sequence of bytes.
It's defined to translate an input file to a sequence of integer
constant expressions.

Maybe it should be defined exactly like that, because that is what
people might expect. You example is better off using a normal text file
which contains an actual comma-delimited list (and which can mix ints
and floats), and a regular #include.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to bart on Wed Feb 28 23:52:55 2024

On Wed, 28 Feb 2024 21:34:14 +0000, bart wrote:

In C:

void Add(int CategoryCode, ItemType Item) {
CodeToIndex_put(CategoryCode, getCount());
add(Item);
}

4 non-comment lines versus 9. I know Java needs tons of boilerplate, but
but it is not all the language's fault.

Or how about

void Add(int CategoryCode, ItemType Item) {CodeToIndex_put(CategoryCode, getCount());add(Item);}

Wow! I never realized you could do that in C!! I thought it was an
error to put stuff after column 72 or something. Thanks for the tip!!!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Keith Thompson on Thu Feb 29 00:47:25 2024

On 28/02/2024 23:31, Keith Thompson wrote:

bart <bc@freeuk.com> writes:

It would be unfortunate if your example was allowed. Clearly a binary
representation of an instance of your struct would probably require 16
bytes rather than 4, of which one may be padding.

Depending on the sizes and alignments of the various types, sure.
So what?

If you have suggestions for alternate ways to define #embed, they might
be interesting, but it's too late to change the existing specification.

My early comments on this were about compiler performance. I suggested
there might be a way to turn 100,000 byte values in a file, directly
into a 100KB string or data block, without needing to first convert
100,000 values into 100,000 integer expressions representated as tokens,
and to then parse those 100,000 expressions into AST nodes etc.

DB suggested something like that was actually done. But you can't do
that if those 100,000 numbers represent from 100KB to 800KB of memory
depending on the data type of the strucure they're initialising.

They might even be mixed type. Or it might be an example like this:

A binary file contains 8 bytes representing one IEEE754 float value. It
is desired to use that to initialise a double array of one element.

However #embed will that into 8 integer values of 0 to 255 each (I assume).

It's not clear either what happens when one of the integers has the
value 150, say, but it is used to initialise an element of type (signed)
char. It sounds like it would make it hard to inialise a char[] array,
when char is signed, from a file of UTF8 text.

Basically, #embed is dumb.

For flexibility, I wouldn't use #embed at all. Just have an actual comma-separated set of values in a text file, and use #include instead.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Lawrence D'Oliveiro on Thu Feb 29 00:15:17 2024

On 28/02/2024 23:52, Lawrence D'Oliveiro wrote:

On Wed, 28 Feb 2024 21:34:14 +0000, bart wrote:

In C:

void Add(int CategoryCode, ItemType Item) {
CodeToIndex_put(CategoryCode, getCount());
add(Item);
}

4 non-comment lines versus 9. I know Java needs tons of boilerplate, but
but it is not all the language's fault.

Or how about

void Add(int CategoryCode, ItemType Item) {CodeToIndex_put(CategoryCode, getCount());add(Item);}

Wow! I never realized you could do that in C!! I thought it was an
error to put stuff after column 72 or something. Thanks for the tip!!!

Well, you could write an entire program on one line.

Or you can write an entire program in one thin column:

v\
o\
i\
d\
....

Or you can use common sense and avoiding writing code which is either
too compact or so spread out vertically that you have to hunt for the
actual code. Like trying to find the bits of meat in a thin soup.

That's what I took away from your Java code, which looks remarkably like
the spaced-out examples you posted recently.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to bart on Thu Feb 29 02:53:33 2024

On Thu, 29 Feb 2024 00:15:17 +0000, bart wrote:

Or you can use common sense and avoiding writing code which is either
too compact or so spread out vertically that you have to hunt for the
actual code. Like trying to find the bits of meat in a thin soup.

Terribly sorry about that. I wonder if you could look at this part of the
same code file:

final android.util.SparseArray<Integer> CodeToIndex =
new android.util.SparseArray<Integer>();

and show me how to thicken that part of my humble, tasteless gruel? Maybe
using that same “_” trick you used to do OO in C in your previous example?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Lawrence D'Oliveiro on Thu Feb 29 08:58:40 2024

On 28/02/2024 21:56, Lawrence D'Oliveiro wrote:

On Wed, 28 Feb 2024 12:50:10 +0100, David Brown wrote:

... people write utilities for them in a variety of languages ...

But it will often be more convenient to have it built into the language
and compiler.

What can be built into the language can only ever be a small subset of
the many and varied ways that people have incorporated data blobs into
their programs.

Of course. But that doesn't mean that a language should not include a
feature that makes it easy for a lot of people to get some data blobs
into their code. Maybe /you/ won't find it useful, but other people will.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Keith Thompson on Thu Feb 29 10:10:10 2024

On 28/02/2024 22:57, Keith Thompson wrote:

David Brown <david.brown@hesbynett.no> writes:
[...]

They won't use strings, they will use data blobs - binary data. Then
there is no issue with null bytes. And yes, implementations will skip
the token generation (unless you are doing something weird, such as
using #embed to read the parameters to a function call).

Tests with prototype implementations gave extremely fast results.

I'm not sure how that would work. #embed is a preprocessor directive,
and at least in the abstract model it has to expand to valid C code.

I would have expected that it would simply generate the list of comma-separated integer constants described in the standard; later
phases would simply parse that list and generate code as if that
sequence had been written in the original source file. Do you know of
an implementation that does something else?

The key thing, as I understand it, is that the compiler gets to know
that the integers in the list are all "nice". And since the
preprocessor and the compiler are part of the same implementation (even
if they are separate programs communicating with pipes or temporary
files), the preprocessor could pass on the binary blob in a pre-parsed form.

Think about what a preprocessor and compiler does with the initialisers
in an array, written in normal text (such as by using "xxd -i" or
another external script). For each integer, it has to divide up the
tokens, identify the comma, parse the integer, check that it is a valid integer, figure out its type based on the size (and suffix, if any). It
needs to record the line number and column number for possible later
reference in error or warning messages. It has to check the value of
the integer against the type for the array elements, and possibly change
the value to suit, or issue warnings for out-of-range values. It has to allocate all the space to store this information as it goes along,
without knowing the size of the array - so it will be lots of small
mallocs and/or wasted space. It's a /lot/. (Simpler compilers can get
away with a bit less effort, especially if they have more limited warnings.)

With #embed, the preprocessor can generate a compiler-specific "start of
embed" informational directive (much like "#line" directives and such
things generated by preprocessors today), then the data in a very
specific format, then an "end of embed" directive. It could, for
example, generate all the integers in the format "0xAB, " with 16
elements per line. The compiler wouldn't need to parse the data
normally - it knows exactly how many elements there are (from the "start
of embed" directive), it knows exactly where to find each entry (as each
is 6 characters long), it only needs to look at two of these characters, there's never any errors, the source line number is fixed (at the #embed
line), and so on.

A more tightly coupled preprocessor and compiler can do even better -
for array initialisation, the binary blob could be used directly without
ever generating integer constants or parsing them.

The results of testing are that #embed is /massively/ faster and lower
memory compared to external generators, especially for larger files.
And it gives you the data on-hand for optimisation purposes, unlike
external direct linking of binary blobs. (So you can get the size of
the array, or use values from it as compile-time known values.)

For example, say you have a file "foo.dat" containing 4 bytes with
values 0, 1, 2, and 3. This would be perfectly valid:

struct foo {
unsigned char a;
unsigned short b;
unsigned int c;
double d;
};

struct foo obj = {
#embed "foo.dat"
};

#embed isn't defined to translate an input file to a sequence of bytes.
It's defined to translate an input file to a sequence of integer
constant expressions.

Yes. But the prime speed (and memory usage) gains come in, are for
large files, and that means array initialisers. That does not conflict
with using it for cases like yours.

*Maybe* a compiler could optimize for the case where it knows that it's
being used to initialize an array of unsigned char, but (a) that would require the preprocessor to have information that normally doesn't exist until later phases, and (b) I'm not convinced it would be worth the
effort.

Look at <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1040r6.html#design-practice-speed>.

In those tests, for a 40 MB file gcc #embed is 200 times faster than
"xxd -i" generated files, and takes about 2.5% of the memory. It scales
to 1 GB files. And that's just a proof-of-concept implementation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Lawrence D'Oliveiro on Thu Feb 29 09:20:30 2024

On 29/02/2024 02:53, Lawrence D'Oliveiro wrote:

On Thu, 29 Feb 2024 00:15:17 +0000, bart wrote:

Or you can use common sense and avoiding writing code which is either
too compact or so spread out vertically that you have to hunt for the
actual code. Like trying to find the bits of meat in a thin soup.

Terribly sorry about that. I wonder if you could look at this part of the same code file:

final android.util.SparseArray<Integer> CodeToIndex =
new android.util.SparseArray<Integer>();

and show me how to thicken that part of my humble, tasteless gruel? Maybe using that same “_” trick you used to do OO in C in your previous example?

You've shown an example of a piece of meat. In main.java, 70% of the
lines are either blanks or contain only an opening or closing bracket.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to David Brown on Thu Feb 29 10:18:21 2024

On 29/02/2024 09:10, David Brown wrote:

On 28/02/2024 22:57, Keith Thompson wrote:

*Maybe* a compiler could optimize for the case where it knows that it's
being used to initialize an array of unsigned char, but (a) that would
require the preprocessor to have information that normally doesn't exist
until later phases, and (b) I'm not convinced it would be worth the
effort.

Look at <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1040r6.html#design-practice-speed>.

In those tests, for a 40 MB file gcc #embed is 200 times faster than
"xxd -i" generated files, and takes about 2.5% of the memory. It scales
to 1 GB files. And that's just a proof-of-concept implementation.

I've just down my own tests, with a 40MB data file containing random
A..Z letters (so can be processed as a text file).

This was converted also to a 120MB text file contain a list of numbers ("65,66,73,...", 3 characters for each data byte).

Using 'strinclude' in my old C compiler, it took about 1 second to build
this program:

#include <stdio.h>
#include <string.h>

char* s=strinclude("data");

int main(void) {
printf("%zu\n", strlen(s));
}

(Running it shows '40000000'.) The same test in my language (which has
no intermediate ASM stage) took 0.3 seconds.

Next I tried instead that 120MB text file containing the same data but
as text, initialising a char[] array using #include.

Tcc took 12 seconds. Bcc took 56 seconds (via ASM etc).

gcc got up to about 3GB memory usage then 'cc1' failed trying to
allocate 0.5GB, after about a minute.

Processing long list of numbers DOES use considerable resources. Bear in
mind that #embed also needs to take binary data and generate tokens,
possibly converting each binary number to text.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Keith Thompson on Thu Feb 29 14:31:11 2024

On 28/02/2024 21:36, Keith Thompson wrote:

bart <bc@freeuk.com> writes:
[...]

AFAIK strings in C can have embedded zeros when not assumed to be
zero-terminated. So here:

char s[]={1,2,3,0,4,5,6};

s will have a length of 7.

Strings *by definition* cannot have embedded zeros. A null character terminates a string.

A string literal can have embedded \0 characters, but if you're
suggesting that #embed should expand to a string literal, I can see
several disadvantages and no significant advantages. For one thing, the
data may or may not end with a null character; string literals always
do.

Not here:

char s[] = "ABC";
char t[3] = "DEF";

The "DEF" string doesn't end with a zero.

Is 'string' given a special meaning in the standard?

/That/ would seem to me to be too restrictive. Does this:

char *s;

define a pointer to a such string, or can it be any kind of data? For
example, `char*` is used by the GetOpenFileName WinAPI function for a
/series/ of zero-terminated strings which itself is terminated with two
zero bytes.

So it is some property that is attributed to the data that will be stored.

I normally use `cstring` or `stringz` outside the language when refering
to a zero-terminated sequences of characters, which implies that
embedded zeros aren't allowed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Harnden@21:1/5 to bart on Thu Feb 29 15:22:18 2024

On 29/02/2024 14:31, bart wrote:

On 28/02/2024 21:36, Keith Thompson wrote:

bart <bc@freeuk.com> writes:
[...]

AFAIK strings in C can have embedded zeros when not assumed to be
zero-terminated. So here:

     char s[]={1,2,3,0,4,5,6};

s will have a length of 7.

Strings *by definition* cannot have embedded zeros. A null character
terminates a string.

A string literal can have embedded \0 characters, but if you're
suggesting that #embed should expand to a string literal, I can see
several disadvantages and no significant advantages. For one thing, the
data may or may not end with a null character; string literals always
do.

Not here:

    char s[] = "ABC";
    char t[3] = "DEF";

The "DEF" string doesn't end with a zero.

And is, therefore, not a string.

Is 'string' given a special meaning in the standard?

Yes. Things that work with the strX functions. Which means they are
'\0' terminated.

/That/ would seem to me to be too restrictive. Does this:

   char *s;

define a pointer to a such string, or can it be any kind of data? For

That points to a char. That could be followed by more chars and it one
of those is a '\0', then it's a string. You know this.

example, `char*` is used by the GetOpenFileName WinAPI function for a /series/ of zero-terminated strings which itself is terminated with two
zero bytes.

That is a windowsism, then.

Why didn't they use the NULL terminated char **argv kind of thing?

So it is some property that is attributed to the data that will be stored.

I normally use `cstring` or `stringz` outside the language when refering
to a zero-terminated sequences of characters, which implies that
embedded zeros aren't allowed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Malcolm McLean on Thu Feb 29 16:19:45 2024

On 29/02/2024 12:56, Malcolm McLean wrote:

On 28/02/2024 21:36, Keith Thompson wrote:

bart <bc@freeuk.com> writes:
[...]

AFAIK strings in C can have embedded zeros when not assumed to be
zero-terminated. So here:

char s[]={1,2,3,0,4,5,6};

s will have a length of 7.

Strings *by definition* cannot have embedded zeros. A null character
terminates a string.

C strings. Not strings in other programming languages.

Let me point you to the name of this Usenet group.

And strings in any programming language have either :

1. A string of characters and a terminating null, thus no embedded null characters.
2. A starting length (such as Pascal strings).
3. A fixed size.
4. A more advanced structure.

An array of bytes is not a "string".

And only if you
define "C strings" in a rather restrictive but, to be fair, totally legitimate way. So I wouldn't have put in the asterisks.

The definition of "C string" is given in section 7.1.1p1 of the C
standards. There is only one definition of "C string".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From tTh@21:1/5 to bart on Thu Feb 29 16:29:24 2024

On 2/29/24 01:47, bart wrote:

My early comments on this were about compiler performance. I suggested
there might be a way to turn 100,000 byte values in a file, directly
into a 100KB string or data block, without needing to first convert
100,000 values into 100,000 integer expressions representated as tokens,
and to then parse those 100,000 expressions into AST nodes etc.

But you HAVE to do that il #embed is in the preprocessor,
because his job is to give compilable text to the real
compiler. No other way is possible.

Basically, #embed is dumb.

No.

--
+---------------------------------------------------------------------+
| https://tube.interhacker.space/a/tth/video-channels | +---------------------------------------------------------------------+

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to bart on Thu Feb 29 16:30:05 2024

On 29/02/2024 15:31, bart wrote:

On 28/02/2024 21:36, Keith Thompson wrote:

bart <bc@freeuk.com> writes:
[...]

AFAIK strings in C can have embedded zeros when not assumed to be
zero-terminated. So here:

     char s[]={1,2,3,0,4,5,6};

s will have a length of 7.

Strings *by definition* cannot have embedded zeros. A null character
terminates a string.

A string literal can have embedded \0 characters, but if you're
suggesting that #embed should expand to a string literal, I can see
several disadvantages and no significant advantages. For one thing, the
data may or may not end with a null character; string literals always
do.

Not here:

    char s[] = "ABC";

"ABC" is a "string literal". Once things like concatenation of adjacent strings, macro expansion, etc., is complete, a null character is
appended to it. Then it is used as an initialiser for the array "s".
After initialisation, "s" is an array of 4 chars and contains a string.

(Note - a "string literal" might not be a "string", because string
literals can contain embedded nulls. This is a footnote in 6.4.5
describing string literals.)

    char t[3] = "DEF";

The "DEF" string doesn't end with a zero.

"DEF" is not a "string" - it is a "string literal". It does get a null character appended during translation phase 7. But only the first three characters - therefore not including the null character - get copied to
"t" during the initialisation of "t". "t" is an array of 3 chars, and
it does not contain a string.

Is 'string' given a special meaning in the standard?

Yes. See 7.1.1p1.

/That/ would seem to me to be too restrictive. Does this:

   char *s;

define a pointer to a such string, or can it be any kind of data? For example, `char*` is used by the GetOpenFileName WinAPI function for a /series/ of zero-terminated strings which itself is terminated with two
zero bytes.

"char *s;" declares a pointer to a char, or a pointer to the start of an
array of char. It is /not/ a string, or a pointer to a string. C
strings are values, and exist at run-time - they are not types. So "s"
can point to a string, or a char (which will be a string if and only if
it is a null character), or an array of chars (which may or may not
contain a string), or it could point to anything else.

So it is some property that is attributed to the data that will be stored.

Yes.

I normally use `cstring` or `stringz` outside the language when refering
to a zero-terminated sequences of characters, which implies that
embedded zeros aren't allowed.

That makes sense. Different languages have different ways of holding
sequences of characters (generic "string" data), so you need to qualify
the term if it is not clear from the context.

But when we are discussing C, and there is no other qualification,
"string" means "C string" - the definition of "string" given in the C standards.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From tTh@21:1/5 to bart on Thu Feb 29 16:34:24 2024

On 2/29/24 11:18, bart wrote:

Using 'strinclude' in my old C compiler, it took about 1 second to build
this program:

#include <stdio.h>
#include <string.h>

char* s=strinclude("data");

int main(void) {
printf("%zu\n", strlen(s));
}

tth@redlady:~/Desktop$ man strinclude
No manual entry for strinclude
tth@redlady:~/Desktop$

--
+---------------------------------------------------------------------+
| https://tube.interhacker.space/a/tth/video-channels | +---------------------------------------------------------------------+

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to bart on Thu Feb 29 15:48:51 2024

bart <bc@freeuk.com> writes:

On 28/02/2024 23:52, Lawrence D'Oliveiro wrote:

On Wed, 28 Feb 2024 21:34:14 +0000, bart wrote:

In C:

void Add(int CategoryCode, ItemType Item) {
CodeToIndex_put(CategoryCode, getCount());
add(Item);
}

4 non-comment lines versus 9. I know Java needs tons of boilerplate, but >>> but it is not all the language's fault.

Or how about

void Add(int CategoryCode, ItemType Item) {CodeToIndex_put(CategoryCode, getCount());add(Item);}

Wow! I never realized you could do that in C!! I thought it was an
error to put stuff after column 72 or something. Thanks for the tip!!!

Well, you could write an entire program on one line.

int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}

(A winner from the obfuscated C contest).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Scott Lurndal on Thu Feb 29 17:03:41 2024

On 29.02.2024 16:48, Scott Lurndal wrote:

int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}

What does it do?

What preconditions must be fulfilled or what additions
does it need to compile?

(A winner from the obfuscated C contest).

(Are non-compiling C sources allowed in the contest?)

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to bart on Thu Feb 29 15:53:04 2024

bart <bc@freeuk.com> writes:

On 28/02/2024 23:31, Keith Thompson wrote:

bart <bc@freeuk.com> writes:

It would be unfortunate if your example was allowed. Clearly a binary
representation of an instance of your struct would probably require 16
bytes rather than 4, of which one may be padding.

Depending on the sizes and alignments of the various types, sure.
So what?

If you have suggestions for alternate ways to define #embed, they might
be interesting, but it's too late to change the existing specification.

My early comments on this were about compiler performance. I suggested
there might be a way to turn 100,000 byte values in a file, directly
into a 100KB string or data block, without needing to first convert
100,000 values into 100,000 integer expressions representated as tokens,
and to then parse those 100,000 expressions into AST nodes etc.

DB suggested something like that was actually done. But you can't do
that if those 100,000 numbers represent from 100KB to 800KB of memory >depending on the data type of the strucure they're initialising.

An implementation is free to simply pass a variant (or the directive
itself) of #embed from the pre-processor to the compiler if the programmer isn't using -E, and the compiler could simply copy the embedded file
into the object file directly, without processing it as a series of
integer values. Much like the #file and #line directives passed by
the pre-processor to the compiler.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Janis Papanagnou on Thu Feb 29 16:17:44 2024

Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

On 29.02.2024 16:48, Scott Lurndal wrote:

int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}

What does it do?

What preconditions must be fulfilled or what additions
does it need to compile?

(A winner from the obfuscated C contest).

(Are non-compiling C sources allowed in the contest?)

https://www.ioccc.org/years.html

The above is from 'burton'.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to tTh on Thu Feb 29 16:15:47 2024

tTh <tth@none.invalid> writes:

On 2/29/24 01:47, bart wrote:

My early comments on this were about compiler performance. I suggested
there might be a way to turn 100,000 byte values in a file, directly
into a 100KB string or data block, without needing to first convert
100,000 values into 100,000 integer expressions representated as tokens,
and to then parse those 100,000 expressions into AST nodes etc.

But you HAVE to do that il #embed is in the preprocessor,
because his job is to give compilable text to the real
compiler. No other way is possible.

The standard does not require the preprocessor to be
separate from a 'real' compiler. It's acceptable for an implementation
to implement both in a single executable. Absent -E, the
preprocessor and compiler can cooperate to efficiently handle
#embed without generating parseable C code.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Scott Lurndal on Thu Feb 29 18:12:20 2024

On 29.02.2024 17:17, Scott Lurndal wrote:

Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

On 29.02.2024 16:48, Scott Lurndal wrote:

int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}

What does it do?

What preconditions must be fulfilled or what additions
does it need to compile?

With the link below I see it "needs" a 600+ lines long Makefile.

And I see there's some variable definitions with magic numbers
passed.

(A winner from the obfuscated C contest).

(Are non-compiling C sources allowed in the contest?)

https://www.ioccc.org/years.html

The above is from 'burton'.

Thanks.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Keith Thompson on Thu Feb 29 18:17:19 2024

On 29.02.2024 17:18, Keith Thompson wrote:

"abc\0def" is a valid string literal, but its value is not a string.
(No, the standard doesn't say that the value of a string literal is a string.)

This sounds somewhat strange in my ears. Usually a literal for a type
will constitute an instance of the type. - I suppose the irregularity
stems from the fact that there's no explicit string object type in C.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Keith Thompson on Thu Feb 29 17:28:33 2024

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

scott@slp53.sl.home (Scott Lurndal) writes:

bart <bc@freeuk.com> writes:

On 28/02/2024 23:31, Keith Thompson wrote:

bart <bc@freeuk.com> writes:

It would be unfortunate if your example was allowed. Clearly a binary >>>>> representation of an instance of your struct would probably require 16 >>>>> bytes rather than 4, of which one may be padding.

Depending on the sizes and alignments of the various types, sure.
So what?

If you have suggestions for alternate ways to define #embed, they might >>>> be interesting, but it's too late to change the existing specification. >>>>

My early comments on this were about compiler performance. I suggested >>>there might be a way to turn 100,000 byte values in a file, directly
into a 100KB string or data block, without needing to first convert >>>100,000 values into 100,000 integer expressions representated as tokens, >>>and to then parse those 100,000 expressions into AST nodes etc.

DB suggested something like that was actually done. But you can't do
that if those 100,000 numbers represent from 100KB to 800KB of memory >>>depending on the data type of the strucure they're initialising.

An implementation is free to simply pass a variant (or the directive
itself) of #embed from the pre-processor to the compiler if the programmer >> isn't using -E, and the compiler could simply copy the embedded file
into the object file directly, without processing it as a series of
integer values. Much like the #file and #line directives passed by
the pre-processor to the compiler.

Sure, an implementation has to operate *as if* it implemented the 8 >translation phases separately. But given a structure initialized with >#embed, it would have to generate additional code to initialize the
structure members from the bytes of the binary blob.

Would it? Or could it simply assume that the binary blob
is already in the same binary format that writing an instance
of the structure from a C application on the same host would have created?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Janis Papanagnou on Thu Feb 29 17:30:00 2024

Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

On 29.02.2024 17:17, Scott Lurndal wrote:

Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

On 29.02.2024 16:48, Scott Lurndal wrote:

int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}

What does it do?

What preconditions must be fulfilled or what additions
does it need to compile?

With the link below I see it "needs" a 600+ lines long Makefile.

The readme simply says compile it and run it
as ./prog <value between 1 and 512>.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Scott Lurndal on Thu Feb 29 18:58:49 2024

On 29/02/2024 18:28, Scott Lurndal wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

scott@slp53.sl.home (Scott Lurndal) writes:

bart <bc@freeuk.com> writes:

On 28/02/2024 23:31, Keith Thompson wrote:

bart <bc@freeuk.com> writes:

It would be unfortunate if your example was allowed. Clearly a binary >>>>>> representation of an instance of your struct would probably require 16 >>>>>> bytes rather than 4, of which one may be padding.

Depending on the sizes and alignments of the various types, sure.
So what?

If you have suggestions for alternate ways to define #embed, they might >>>>> be interesting, but it's too late to change the existing specification. >>>>>

My early comments on this were about compiler performance. I suggested >>>> there might be a way to turn 100,000 byte values in a file, directly
into a 100KB string or data block, without needing to first convert
100,000 values into 100,000 integer expressions representated as tokens, >>>> and to then parse those 100,000 expressions into AST nodes etc.

DB suggested something like that was actually done. But you can't do
that if those 100,000 numbers represent from 100KB to 800KB of memory
depending on the data type of the strucure they're initialising.

An implementation is free to simply pass a variant (or the directive
itself) of #embed from the pre-processor to the compiler if the programmer >>> isn't using -E, and the compiler could simply copy the embedded file
into the object file directly, without processing it as a series of
integer values. Much like the #file and #line directives passed by
the pre-processor to the compiler.

Sure, an implementation has to operate *as if* it implemented the 8
translation phases separately. But given a structure initialized with
#embed, it would have to generate additional code to initialize the
structure members from the bytes of the binary blob.

Would it? Or could it simply assume that the binary blob
is already in the same binary format that writing an instance
of the structure from a C application on the same host would have created?

That would depend on the sizes of the fields in the struct, and the size
of the integer constants in the #embed. The norm for #embed will be
unsigned char integer constants, so it will only be a direct fit for the
binary representation of the struct if all the struct fields are
compatible with that. But a compiler could have vendor parameters on
the #embed to change those sizes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to David Brown on Thu Feb 29 18:05:50 2024

David Brown <david.brown@hesbynett.no> writes:

On 29/02/2024 18:28, Scott Lurndal wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

scott@slp53.sl.home (Scott Lurndal) writes:

bart <bc@freeuk.com> writes:

On 28/02/2024 23:31, Keith Thompson wrote:

bart <bc@freeuk.com> writes:

It would be unfortunate if your example was allowed. Clearly a binary >>>>>>> representation of an instance of your struct would probably require 16 >>>>>>> bytes rather than 4, of which one may be padding.

Depending on the sizes and alignments of the various types, sure.
So what?

If you have suggestions for alternate ways to define #embed, they might >>>>>> be interesting, but it's too late to change the existing specification. >>>>>>

My early comments on this were about compiler performance. I suggested >>>>> there might be a way to turn 100,000 byte values in a file, directly >>>>> into a 100KB string or data block, without needing to first convert
100,000 values into 100,000 integer expressions representated as tokens, >>>>> and to then parse those 100,000 expressions into AST nodes etc.

DB suggested something like that was actually done. But you can't do >>>>> that if those 100,000 numbers represent from 100KB to 800KB of memory >>>>> depending on the data type of the strucure they're initialising.

An implementation is free to simply pass a variant (or the directive
itself) of #embed from the pre-processor to the compiler if the programmer >>>> isn't using -E, and the compiler could simply copy the embedded file
into the object file directly, without processing it as a series of
integer values. Much like the #file and #line directives passed by
the pre-processor to the compiler.

Sure, an implementation has to operate *as if* it implemented the 8
translation phases separately. But given a structure initialized with
#embed, it would have to generate additional code to initialize the
structure members from the bytes of the binary blob.

Would it? Or could it simply assume that the binary blob
is already in the same binary format that writing an instance
of the structure from a C application on the same host would have created?

That would depend on the sizes of the fields in the struct, and the size
of the integer constants in the #embed.

I'm embedding a binary file. I want the representation in memory
to be _exactly_ the same as in the file, regardless of how it is
defined in the C code (array of char, array of int, array of long, struct whatever).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Scott Lurndal on Thu Feb 29 18:09:52 2024

scott@slp53.sl.home (Scott Lurndal) writes:

David Brown <david.brown@hesbynett.no> writes:

On 29/02/2024 18:28, Scott Lurndal wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

scott@slp53.sl.home (Scott Lurndal) writes:

An implementation is free to simply pass a variant (or the directive >>>>> itself) of #embed from the pre-processor to the compiler if the programmer
isn't using -E, and the compiler could simply copy the embedded file >>>>> into the object file directly, without processing it as a series of
integer values. Much like the #file and #line directives passed by
the pre-processor to the compiler.

Sure, an implementation has to operate *as if* it implemented the 8
translation phases separately. But given a structure initialized with >>>> #embed, it would have to generate additional code to initialize the
structure members from the bytes of the binary blob.

Would it? Or could it simply assume that the binary blob
is already in the same binary format that writing an instance
of the structure from a C application on the same host would have created? >>

That would depend on the sizes of the fields in the struct, and the size
of the integer constants in the #embed.

I'm embedding a binary file. I want the representation in memory
to be _exactly_ the same as in the file, regardless of how it is
defined in the C code (array of char, array of int, array of long, struct whatever).

I have an actual use case today where #embed of a (C++) std::map binary
object created by separate tool would be very useful. I'm
planning on using mmap to load it at runtime at the moment.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Keith Thompson on Thu Feb 29 19:08:22 2024

On 29/02/2024 18:03, Keith Thompson wrote:

David Brown <david.brown@hesbynett.no> writes:

On 28/02/2024 22:57, Keith Thompson wrote:

David Brown <david.brown@hesbynett.no> writes:
[...]

They won't use strings, they will use data blobs - binary data. Then
there is no issue with null bytes. And yes, implementations will skip >>>> the token generation (unless you are doing something weird, such as
using #embed to read the parameters to a function call).

Tests with prototype implementations gave extremely fast results.

I'm not sure how that would work. #embed is a preprocessor
directive,
and at least in the abstract model it has to expand to valid C code.
I would have expected that it would simply generate the list of
comma-separated integer constants described in the standard; later
phases would simply parse that list and generate code as if that
sequence had been written in the original source file. Do you know of
an implementation that does something else?

The key thing, as I understand it, is that the compiler gets to know
that the integers in the list are all "nice". And since the
preprocessor and the compiler are part of the same implementation
(even if they are separate programs communicating with pipes or
temporary files), the preprocessor could pass on the binary blob in a
pre-parsed form.

[...]

Sure, an implementation *could* optimize #embed so it expands to some implementation-defined nonstandard form that later phases can treat as
raw data. But since it's defined as a preprocessor directive, it's
difficult to see how it could do so while covering all cases.

It would require a strong link between the compiler and the preprocessor
- as you know, these don't have to be separate programs. In a more
weakly coupled system, there could still be a method for passing a
binary blob to the compiler in addition to the integer data, and let the compiler use whichever form it preferred (based on what your code does
with the data).

[...]

The results of testing are that #embed is /massively/ faster and lower
memory compared to external generators, especially for larger files.
And it gives you the data on-hand for optimisation purposes, unlike
external direct linking of binary blobs. (So you can get the size of
the array, or use values from it as compile-time known values.)

What testing? The very latest versions of gcc and clang (I checked both their git repos yesterday) do not yet implement #embed.

I believe prototypes, tests, or proofs of concept have been made for
gcc, clang and perhaps other tools. I posted a link to some results -
more are floating around the internet if you want to look for them.

For example, say you have a file "foo.dat" containing 4 bytes with
values 0, 1, 2, and 3. This would be perfectly valid:
struct foo {
unsigned char a;
unsigned short b;
unsigned int c;
double d;
};
struct foo obj = {
#embed "foo.dat"
};
#embed isn't defined to translate an input file to a sequence of
bytes.
It's defined to translate an input file to a sequence of integer
constant expressions.

Yes. But the prime speed (and memory usage) gains come in, are for
large files, and that means array initialisers. That does not
conflict with using it for cases like yours.

So a compiler that does this would have to be able to handle

struct foo obj = {
#blob
<binary data>
#endblob>
};

and initialize a, b, c, and d to 0, 1, 2, and 3.0, respectively from successive bytes of the binary data. Either that, or the preprocessor
would have to use information it doesn't have to determine how to expand #embed.

I think I've covered how that could be handled. (And I don't know how
it /will/ be handled. But I am sure compiler implementers will figure a
way to make it work correctly for any use of the integer constant list,
while also making it as efficient as they reasonably can for the common
case of initialising an unsigned char array.)

*Maybe* a compiler could optimize for the case where it knows that it's
being used to initialize an array of unsigned char, but (a) that would
require the preprocessor to have information that normally doesn't exist >>> until later phases, and (b) I'm not convinced it would be worth the
effort.

Look at
<https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1040r6.html#design-practice-speed>.

In those tests, for a 40 MB file gcc #embed is 200 times faster than
"xxd -i" generated files, and takes about 2.5% of the memory. It
scales to 1 GB files. And that's just a proof-of-concept
implementation.

That's for std::embed, a proposed C++ feature that's *not* defined as a preprocessor directive. Sample usage from the paper:

constexpr std::span<const std::byte> fxaa_binary =
std::embed( "fxaa.spirv" );

So the compiler knows the type of the object being initialized.

(Note that the author of that C++ paper is also the editor for the C standard.)

The work on #embed is being done simultaneously for C and C++.
std::embed() gives you slightly different way to write it, but the implementation is the same. (Not unlike _Pragma and #pragma in C.)

Other pages I have seen with speed tests show the same pattern while
referring explicitly to #embed.

I'm still skeptical that C's #embed will actually be implemented other
than as expanding to a sequence of integer constants.

We'll see when it all hits the mainline compilers!

On the other hand, C23 allows for additional implementation-defined parameters to #embed (as well as the standard embed parameters limit,
prefix, suffix, and is_empty). Such a parameter could specify how it's expanded, perhaps to some implementation-defined blob format. *If*
compilers optimize #embed to something other than a sequence of integer constant expressions, that's probably how it would be done. But since neither gcc nor clang implements #embed at all, it may be too early to speculate.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Kuyper@21:1/5 to Kaz Kylheku on Thu Feb 29 14:45:44 2024

On 2/29/24 14:26, Kaz Kylheku wrote:

On 2024-02-29, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

Exactly, "string" is not a type.

It is a type in the broader sense, in that is a logical proposition
about the attributes of an object that is true or false.

If I defined something to be a sequence of floating point numbers
terminated by a NaN, would that thing also qualify as a type, according
to the definition you're using?

Could you give a source for the definition of "type" that you're using?
Can you use the word "type" in a statement whose truth relies upon the difference between that definition and the way that "type" is defined by
the C standard? Preferably it would be a useful statement that applies to C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Scott Lurndal on Thu Feb 29 20:51:11 2024

On 29/02/2024 19:05, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 29/02/2024 18:28, Scott Lurndal wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

scott@slp53.sl.home (Scott Lurndal) writes:

bart <bc@freeuk.com> writes:

On 28/02/2024 23:31, Keith Thompson wrote:

bart <bc@freeuk.com> writes:

It would be unfortunate if your example was allowed. Clearly a binary >>>>>>>> representation of an instance of your struct would probably require 16 >>>>>>>> bytes rather than 4, of which one may be padding.

Depending on the sizes and alignments of the various types, sure. >>>>>>> So what?

If you have suggestions for alternate ways to define #embed, they might >>>>>>> be interesting, but it's too late to change the existing specification. >>>>>>>

My early comments on this were about compiler performance. I suggested >>>>>> there might be a way to turn 100,000 byte values in a file, directly >>>>>> into a 100KB string or data block, without needing to first convert >>>>>> 100,000 values into 100,000 integer expressions representated as tokens, >>>>>> and to then parse those 100,000 expressions into AST nodes etc.

DB suggested something like that was actually done. But you can't do >>>>>> that if those 100,000 numbers represent from 100KB to 800KB of memory >>>>>> depending on the data type of the strucure they're initialising.

An implementation is free to simply pass a variant (or the directive >>>>> itself) of #embed from the pre-processor to the compiler if the programmer
isn't using -E, and the compiler could simply copy the embedded file >>>>> into the object file directly, without processing it as a series of
integer values. Much like the #file and #line directives passed by
the pre-processor to the compiler.

Sure, an implementation has to operate *as if* it implemented the 8
translation phases separately. But given a structure initialized with >>>> #embed, it would have to generate additional code to initialize the
structure members from the bytes of the binary blob.

Would it? Or could it simply assume that the binary blob
is already in the same binary format that writing an instance
of the structure from a C application on the same host would have created? >>

That would depend on the sizes of the fields in the struct, and the size
of the integer constants in the #embed.

I'm embedding a binary file. I want the representation in memory
to be _exactly_ the same as in the file, regardless of how it is
defined in the C code (array of char, array of int, array of long, struct whatever).

Then you would want a union of the struct type and an appropriately
sized unsigned char array, and initialise the unsigned char area with
the bytes of the file using #embed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Keith Thompson on Thu Feb 29 19:26:42 2024

On 2024-02-29, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

On 29.02.2024 17:18, Keith Thompson wrote:

"abc\0def" is a valid string literal, but its value is not a string.
(No, the standard doesn't say that the value of a string literal is a
string.)

This sounds somewhat strange in my ears. Usually a literal for a type
will constitute an instance of the type. - I suppose the irregularity
stems from the fact that there's no explicit string object type in C.

Exactly, "string" is not a type.

It is a type in the broader sense, in that is a logical proposition
about the attributes of an object that is true or false.

It's just not a type in the C static type system.

What that means is that there does not exist a constraint rule in
standard C requiring some expression or object to conform to the string
type. The concept "string" is not represented in the constraint system.
But it is a type concept.

(There are rules that require a string, but they are not constraint
rules. E.g. if strlen is given an argument which isn't a string, the
behavior is undefined.)

Consider:

char a[3] = "abc";
size_t l = strlen(a);

In the unlikely event that this example would capture the attention of a computer scientist who researches type systems, he or she would identify
that as having a type error. (One that the C type system is too weak to
model.)

"Upper case letter" is also a type; that's why the header is called
<ctype.h>.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to David Brown on Thu Feb 29 21:05:02 2024

On Thu, 29 Feb 2024 08:58:40 +0100, David Brown wrote:

On 28/02/2024 21:56, Lawrence D'Oliveiro wrote:

On Wed, 28 Feb 2024 12:50:10 +0100, David Brown wrote:

... people write utilities for them in a variety of languages ...

But it will often be more convenient to have it built into the
language and compiler.

What can be built into the language can only ever be a small subset of
the many and varied ways that people have incorporated data blobs into
their programs.

Of course. But that doesn't mean that a language should not include a feature that makes it easy for a lot of people to get some data blobs
into their code.

Maybe the C compiler should concentrate on compiling C code, and leave it
to the rest of the build toolchain to deal with other data.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Thu Feb 29 21:27:52 2024

On Thu, 29 Feb 2024 18:09:52 GMT, Scott Lurndal wrote:

I have an actual use case today where #embed of a (C++) std::map binary object created by separate tool would be very useful. I'm planning on
using mmap to load it at runtime at the moment.

Why not convert it to a .o file and statically link it into your program
as part of the build process?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Keith Thompson on Thu Feb 29 21:44:20 2024

On 29/02/2024 21:20, Keith Thompson wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

On 29.02.2024 17:17, Scott Lurndal wrote:

Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

On 29.02.2024 16:48, Scott Lurndal wrote:

int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}

What does it do?

What preconditions must be fulfilled or what additions
does it need to compile?

With the link below I see it "needs" a 600+ lines long Makefile.

The readme simply says compile it and run it
as ./prog <value between 1 and 512>.

No, you have to compile it with specific command-line arguments to
define B and I. The Makefile does that (don't ask me why it's so long),
but you can do it manually.

From hint.txt:
"""
On a little-endian machine:

clang -include stdio.h -include stdlib.h -Wall -Weverything -pedantic -DB=6945503773712347754LL -DI=5859838231191962459LL -DT=0 -DS=7 -o prog prog.c

On a big-endian machine:

clang -include stdio.h -include stdlib.h -Wall -Weverything -pedantic -DB=7091606191627001958LL -DI=6006468689561538903LL -DT=1 -DS=0 -o prog.be prog.c
"""

In't it cheating when half the program is part of the build
instructions? Here is a complete standalone program:

----------------
#include <stdio.h>
#include <stdlib.h>
#define B 6945503773712347754LL
#define I 5859838231191962459LL
#define T 0
#define S 7
int main(int b,char**i){long long n=B,a=I^n,r=(a/b&a)>>4,y=atoi(*++i),_=(((a^n/b)*(y>>T)|y>>S)&r)|(a^r);printf("%.8s\n",(char*)&_);}
----------------

It's 261 bytes. The 'one-liner' that was posted was 134 bytes.

(It appears to print an input of 0 to 255 as binary.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to David Brown on Thu Feb 29 21:36:05 2024

On Thu, 29 Feb 2024 16:19:45 +0100, David Brown wrote:

An array of bytes is not a "string".

It is in PHP, I think also in Perl, and also in (obsolete) Python 2.

And what about C string functions that take explicit lengths?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Lawrence D'Oliveiro on Fri Mar 1 09:16:00 2024

On 29/02/2024 22:05, Lawrence D'Oliveiro wrote:

On Thu, 29 Feb 2024 08:58:40 +0100, David Brown wrote:

On 28/02/2024 21:56, Lawrence D'Oliveiro wrote:

On Wed, 28 Feb 2024 12:50:10 +0100, David Brown wrote:

... people write utilities for them in a variety of languages ...

But it will often be more convenient to have it built into the
language and compiler.

What can be built into the language can only ever be a small subset of
the many and varied ways that people have incorporated data blobs into
their programs.

Of course. But that doesn't mean that a language should not include a
feature that makes it easy for a lot of people to get some data blobs
into their code.

Maybe the C compiler should concentrate on compiling C code, and leave it
to the rest of the build toolchain to deal with other data.

It is possible to be actively involved in the development of the
standards - preparing and discussing proposals, joining committees, or
at least joining mailing lists for the discussions. If you are not
doing the work and showing the interest /before/ decisions are made, you
don't get a say afterwards. It is more productive to discuss what you
can do with the features C has, than to wish it never had them.

Oh, and the reason C23 has #embed, is because people want it. It is
something C developers have asked for for many years. /You/ might not
have use of it, but that's true of lots of features of C for all
programmers - no one needs everything in the language and standard library.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to Lawrence D'Oliveiro on Fri Mar 1 11:52:16 2024

On 29/02/2024 21:27, Lawrence D'Oliveiro wrote:

On Thu, 29 Feb 2024 18:09:52 GMT, Scott Lurndal wrote:

I have an actual use case today where #embed of a (C++) std::map binary
object created by separate tool would be very useful. I'm planning on
using mmap to load it at runtime at the moment.

Why not convert it to a .o file and statically link it into your program
as part of the build process?

That's exactly what #embed will enable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From bart@21:1/5 to tTh on Fri Mar 1 11:58:49 2024

On 29/02/2024 15:34, tTh wrote:

On 2/29/24 11:18, bart wrote:

Using 'strinclude' in my old C compiler, it took about 1 second to
build this program:

   #include <stdio.h>
   #include <string.h>

   char* s=strinclude("data");

   int main(void) {
      printf("%zu\n", strlen(s));
  }

tth@redlady:~/Desktop$ man strinclude
No manual entry for strinclude
tth@redlady:~/Desktop$

'strinclude' is an extension I made for that compiler.

#embed is the new feature of C23. Although I'm not sure how it would be
used to initialise a char* pointer. Perhaps like this:

char dummy[] {
#embed "data"
,0};
char* s = dummy;

(I've added a 0-terminator here; I don't know if #embed will take care
of that.)

My 'strinclude' produces a zero-terminated string, but it is done within
the parser rather than lexer.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Harnden@21:1/5 to Lawrence D'Oliveiro on Fri Mar 1 12:59:43 2024

On 29/02/2024 21:36, Lawrence D'Oliveiro wrote:

On Thu, 29 Feb 2024 16:19:45 +0100, David Brown wrote:

An array of bytes is not a "string".

It is in PHP, I think also in Perl, and also in (obsolete) Python 2.

And what about C string functions that take explicit lengths?

You mean: There's a danger that a function that returns a 'string', but truncates it to n chars, might not be returning a string at all ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to bart on Fri Mar 1 13:17:01 2024

On 01/03/2024 12:58, bart wrote:

On 29/02/2024 15:34, tTh wrote:

On 2/29/24 11:18, bart wrote:

Using 'strinclude' in my old C compiler, it took about 1 second to
build this program:

   #include <stdio.h>
   #include <string.h>

   char* s=strinclude("data");

   int main(void) {
      printf("%zu\n", strlen(s));
  }

tth@redlady:~/Desktop$ man strinclude
No manual entry for strinclude
tth@redlady:~/Desktop$

'strinclude' is an extension I made for that compiler.

#embed is the new feature of C23. Although I'm not sure how it would be
used to initialise a char* pointer. Perhaps like this:

    char dummy[] {
    #embed "data"
    ,0};
    char* s = dummy;

(I've added a 0-terminator here; I don't know if #embed will take care
of that.)

#embed very specifically does not add anything. So you would do :

const char s[] = {
#embed "data" suffix(,)
0
};

The "suffix" parameter adds a comma if "data" is not empty, and does
nothing if "data" is empty. Writing it as you did would work fine for non-empty "data" but give the nonsensical results {,0} if "data" is
empty. (You might not care about such cases and prefer to write the
simpler version, but now you also know about "suffix".)

There is no need to have a separate character pointer variable - the
const char array can be used directly in most circumstances.

My 'strinclude' produces a zero-terminated string, but it is done within
the parser rather than lexer.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to David Brown on Fri Mar 1 16:55:51 2024

On 2024-03-01, David Brown <david.brown@hesbynett.no> wrote:

It is possible to be actively involved in the development of the
standards - preparing and discussing proposals, joining committees, or
at least joining mailing lists for the discussions. If you are not
doing the work and showing the interest /before/ decisions are made, you don't get a say afterwards. It is more productive to discuss what you
can do with the features C has, than to wish it never had them.

Also, if you don't join the gang that breaks windows and spray
paints walls, you don't get to say aftward which windows are broken
and what is scribbled on what wall.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Keith Thompson on Fri Mar 1 18:09:05 2024

On 29.02.2024 23:06, Keith Thompson wrote:

bart <bc@freeuk.com> writes:
[...]

In't it cheating when half the program is part of the build
instructions?

I recall from decades ago (when I looked into this contest) that they
even had a contribution that fed the whole C program into the compiler
through compiler options. (I think it even got a prize.)

"Is it cheating?" - I'd say no, since it was accepted.

Is it really about an "obfuscated C code"? - I'd say no. (But it was
anyway a curiosity.)

Apparently not. If it were, the judges of the IOCCC would not have
accepted it.
[...]

One of the winners of the 1988 contest was:
```
#include "/dev/tty"

This is great! :-)

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Kaz Kylheku on Fri Mar 1 18:28:05 2024

On 01/03/2024 17:55, Kaz Kylheku wrote:

On 2024-03-01, David Brown <david.brown@hesbynett.no> wrote:

It is possible to be actively involved in the development of the
standards - preparing and discussing proposals, joining committees, or
at least joining mailing lists for the discussions. If you are not
doing the work and showing the interest /before/ decisions are made, you
don't get a say afterwards. It is more productive to discuss what you
can do with the features C has, than to wish it never had them.

Also, if you don't join the gang that breaks windows and spray
paints walls, you don't get to say aftward which windows are broken
and what is scribbled on what wall.

A slightly closer version of that feeble analogy would be that you don't
get to say they should have used a different colour, or broken doors
instead of windows.

It's okay for Lawrence (or anyone else) to say that don't approve of
#embed, or don't think they will use it themselves. But like most
(probably all) features in newer C standards, it was added because
enough people wanted it for the committee and connected developers to do
the work designing and documenting the features, and testing prototypes
in practice.

There are procedures in place for people to have an influence on the
future of C. If you want to have your say, you can have it. But
waiting until a new standard version is solidified and then complaining
that you don't like the direction it is taking, is too late. Whining
about things here afterwards doesn't do anyone any good.

That's different from saying you don't like the feature, or you don't
like the way C is heading, or you won't use it yourself. And it's
different from talking about it, trying to learn how a new feature works
and how to make the best of it. Such discussions are great, and I'd
love to see more of them here in c.l.c.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Richard Harnden on Fri Mar 1 20:59:11 2024

On Fri, 1 Mar 2024 12:59:43 +0000, Richard Harnden wrote:

You mean: There's a danger that a function that returns a 'string', but truncates it to n chars, might not be returning a string at all ?

If it’s not NUL-terminated, then it’s not a “string”, right?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Keith Thompson on Fri Mar 1 22:06:03 2024

On 01.03.2024 19:49, Keith Thompson wrote:

Like most Abuse of the Rules winners, it resulted in a rule change for
the following years.

Makes sense.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to bart on Tue Mar 5 04:47:18 2024

On Fri, 1 Mar 2024 11:52:16 +0000, bart wrote:

On 29/02/2024 21:27, Lawrence D'Oliveiro wrote:

On Thu, 29 Feb 2024 18:09:52 GMT, Scott Lurndal wrote:

I have an actual use case today where #embed of a (C++) std::map
binary object created by separate tool would be very useful. I'm
planning on using mmap to load it at runtime at the moment.

Why not convert it to a .o file and statically link it into your
program as part of the build process?

That's exactly what #embed will enable.

You can call it a toy version of objcopy <https://manpages.debian.org/1/objcopy.1.html>.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Keith Thompson on Tue Mar 5 04:48:38 2024

On Thu, 29 Feb 2024 14:14:52 -0800, Keith Thompson wrote:

"A *string* is a contiguous sequence of characters
terminated by and including the first null character."

So how come strlen(3) does not include the null?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Tue Mar 5 15:09:06 2024

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Fri, 1 Mar 2024 11:52:16 +0000, bart wrote:

On 29/02/2024 21:27, Lawrence D'Oliveiro wrote:

On Thu, 29 Feb 2024 18:09:52 GMT, Scott Lurndal wrote:

I have an actual use case today where #embed of a (C++) std::map
binary object created by separate tool would be very useful. I'm
planning on using mmap to load it at runtime at the moment.

Why not convert it to a .o file and statically link it into your
program as part of the build process?

That's exactly what #embed will enable.

You can call it a toy version of objcopy ><https://manpages.debian.org/1/objcopy.1.html>.

While objcopy supports a number of ways to
manipulate an ELF file, I wouldn't equate it
with #embed at all.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Wed Mar 6 01:49:36 2024

On Tue, 05 Mar 2024 15:09:06 GMT, Scott Lurndal wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Fri, 1 Mar 2024 11:52:16 +0000, bart wrote:

On 29/02/2024 21:27, Lawrence D'Oliveiro wrote:

On Thu, 29 Feb 2024 18:09:52 GMT, Scott Lurndal wrote:

I have an actual use case today where #embed of a (C++) std::map
binary object created by separate tool would be very useful. I'm
planning on using mmap to load it at runtime at the moment.

Why not convert it to a .o file and statically link it into your
program as part of the build process?

That's exactly what #embed will enable.

You can call it a toy version of objcopy >><https://manpages.debian.org/1/objcopy.1.html>.

While objcopy supports a number of ways to manipulate an ELF file, I
wouldn't equate it with #embed at all.

It does a whole lot more.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Keith Thompson on Thu Mar 7 21:08:48 2024

On Mon, 04 Mar 2024 20:55:28 -0800, Keith Thompson wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Thu, 29 Feb 2024 14:14:52 -0800, Keith Thompson wrote:

"A *string* is a contiguous sequence of characters terminated by and
including the first null character."

So how come strlen(3) does not include the null?

Because the *length of a string* is by definition "the number of bytes preceding the null character".

So the “string” itself includes the null character, but its “length” does
not?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Lawrence D'Oliveiro on Thu Mar 7 21:44:06 2024

On 2024-03-07, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Mon, 04 Mar 2024 20:55:28 -0800, Keith Thompson wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Thu, 29 Feb 2024 14:14:52 -0800, Keith Thompson wrote:

"A *string* is a contiguous sequence of characters terminated by and
including the first null character."

So how come strlen(3) does not include the null?

Because the *length of a string* is by definition "the number of bytes
preceding the null character".

So the “string” itself includes the null character, but its “length” does
not?

That's correct. However, its size includes it.

sizeof "abc" == 4

strlen("abc") == 3

The abstract string does not include the null character;
we understand "abc" to be a three character string.

The C representation of the string includes the null character;
the size is a representational concept so it counts it.

It is common for C programs to break encapsulation and openly deal with
that terminating null.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Keith Thompson on Thu Mar 7 23:00:20 2024

On 2024-03-07, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

Kaz Kylheku <433-929-6894@kylheku.com> writes:

On 2024-03-07, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Mon, 04 Mar 2024 20:55:28 -0800, Keith Thompson wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Thu, 29 Feb 2024 14:14:52 -0800, Keith Thompson wrote:

"A *string* is a contiguous sequence of characters terminated by and >>>>>> including the first null character."

So how come strlen(3) does not include the null?

Because the *length of a string* is by definition "the number of bytes >>>> preceding the null character".

So the “string” itself includes the null character, but its “length” does
not?

That's correct. However, its size includes it.

sizeof "abc" == 4

strlen("abc") == 3

The abstract string does not include the null character;
we understand "abc" to be a three character string.

Sure, if you define "abstract string" that way. I'll just note that C's definition of the word "string" does include the terminating null
character, and does not talk about "abstract strings". (A string in the abstract machine clearly includes the null character, but that's a bit
of a stretch.)

Yes; "abstract machine" is not what I mean by abstract.

The concept of the abstract string lives in the semantics though.

When N strings are catenated together, their abstract strings are
juxtaposed together without any nulls in between, with only a single
null at the end.

Furthermore, when a string is sent to a stream with %s or {f}puts,
the null byte is omitted, like in the calculation of length.

Clearly, there is a semantics that the part before the null byte
is the text processing payload; what I'm calling the abstract string.

(With character encodings, it gets hairy. The part before the null
may be a UTF-8 sequence, where the abstract string consists of code
points. Which may be combining characters, so the True Scotsman's
abstract string is the sequence of characters.)
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Harnden@21:1/5 to Keith Thompson on Fri Mar 8 00:26:04 2024

On 07/03/2024 22:25, Keith Thompson wrote:

Kaz Kylheku <433-929-6894@kylheku.com> writes:

On 2024-03-07, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Mon, 04 Mar 2024 20:55:28 -0800, Keith Thompson wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Thu, 29 Feb 2024 14:14:52 -0800, Keith Thompson wrote:

"A *string* is a contiguous sequence of characters terminated by and >>>>>> including the first null character."

So how come strlen(3) does not include the null?

Because the *length of a string* is by definition "the number of bytes >>>> preceding the null character".

So the “string” itself includes the null character, but its “length” does
not?

That's correct. However, its size includes it.

sizeof "abc" == 4

strlen("abc") == 3

The abstract string does not include the null character;
we understand "abc" to be a three character string.

Sure, if you define "abstract string" that way. I'll just note that C's definition of the word "string" does include the terminating null
character, and does not talk about "abstract strings". (A string in the abstract machine clearly includes the null character, but that's a bit
of a stretch.)

A string is just a data format.

You have a string of chars, terminated by a '\0'.

You can have a "string" of anything, terminated by a NULL.
Everyone's used to argv, for example.

Yes, I'm being annoyingly pedantic.

The C representation of the string includes the null character;
the size is a representational concept so it counts it.

It is common for C programs to break encapsulation and openly deal with
that terminating null.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	150:17:50
Calls:	10,383
Files:	14,054
Messages:	6,417,787

Implicit String-Literal Concatenation

Who's Online

System Info