Forum: >>> Magnum BBS <<<

Storing strings

From James Harris@21:1/5 to All on Mon Oct 24 11:31:11 2022

Do you guys have any thoughts on the best ways for strings of characters
to be stored?

1. There's the C way, of course, of reserving one value (zero) and using
it as a terminator.

2. There's the 'length prefix' option of putting the length of the
string in a machine word before the characters.

3. There's the 'double pointer' way of pointing at, say, first and past
(where 'past' means first plus length such that the second pointer
points one position beyond the last character).

Any others?

Options 1 and 2 have the advantage that they can be referred to simply
by address. Option 3 needs an additional place in which to store the
(first, past) control block.

Option 1 has the advantage that it's easy for a program to process (by
either pointer or index).

Options 1 and 3 have the advantage that one can refer to the tail of the
string (anything past the first character) without creating a copy,
although option 3 would need a new control block to be created. Option 2
would require a new string to be created.

In fact, option 3 has the advantage that it allows any continuous
substring - head, mid, or tail - to be referred to without making a copy
of the required part of the string.

Options 2 and 3 make it fast to find the length. They also allow any
value (i.e. including zero) to be part of the string.

So: Which of those should a compiler support? Should it support more
than one form? If so, should the language allow the programmer to
specify which form to use on any particular string?

If that's not complicated enough, the above essentially considers
strings whose contents could be read-only or read-write but their
lengths don't change. If the lengths can change then there are
additional issues of storage management. Eek! ;)

Recommendations welcome!

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Ram@21:1/5 to James Harris on Mon Oct 24 11:44:38 2022

James Harris <james.harris.1@gmail.com> writes:

So: Which of those should a compiler support? Should it support more
than one form? If so, should the language allow the programmer to
specify which form to use on any particular string?

I think the idea of C is to leave it up to the programmer.
The C string literals and functions are just some kind of
suggestion, and they help to provide basic services, such
as printing some text to the terminal. But otherwise, the
programmer is free to implement his own string type(s) or
use string libraries.

The choice depends on the expected type of use. For example,
some ways to store strings are known as "ropes" (Hans J Boehm,
1994), others are known as "gap buffers". A text editor
might simultaneously use ropes for its text buffers and
C strings for filenames.

The crucial thing for allowing programmers to implement
their own string type is that the languages is fast enough
to do this with little overhead compared to an implementation
of strings in the langugage itself. Implementing custom
string representations in slow languages might not feasible.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dmitry A. Kazakov@21:1/5 to James Harris on Mon Oct 24 14:07:32 2022

On 2022-10-24 12:31, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of characters
to be stored?

1. There's the C way, of course, of reserving one value (zero) and using
it as a terminator.

2. There's the 'length prefix' option of putting the length of the
string in a machine word before the characters.

3. There's the 'double pointer' way of pointing at, say, first and past (where 'past' means first plus length such that the second pointer
points one position beyond the last character).

Any others?

4. String body only. The constraints are known outside.

This is the way string slices and fixed length strings are implemented.
In the later case the compiler knows the strings bounds (first and last
indices and thus the length). In the former case the compiler passes a
"string dope" along with the naked body. The dope contains the bounds.

This has an effect on pointers. E.g. if you want slices and efficient
raw strings you must distinguish pointers to definite (constrained) vs. indefinite (unconstrained) objects of same type.

E.g. in Ada you cannot take an indefinite string pointer to a fixed
length string because there is no bounds. If you wanted that feature you
would use a "fat pointer" to carry bounds with it.

This is similar to atomic, volatile objects and pointers to. The
mechanics is same. You cannot take a general-purpose pointer to an
atomic object, because the client code would not know that it should
take care upon dereferencing.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to James Harris on Mon Oct 24 15:28:44 2022

On 24/10/2022 11:31, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of characters
to be stored?

1. There's the C way, of course, of reserving one value (zero) and using
it as a terminator.

2. There's the 'length prefix' option of putting the length of the
string in a machine word before the characters.

3. There's the 'double pointer' way of pointing at, say, first and past (where 'past' means first plus length such that the second pointer
points one position beyond the last character).

Any others?

Options 1 and 2 have the advantage that they can be referred to simply
by address. Option 3 needs an additional place in which to store the
(first, past) control block.

Option 1 has the advantage that it's easy for a program to process (by
either pointer or index).

Options 1 and 3 have the advantage that one can refer to the tail of the string (anything past the first character) without creating a copy,
although option 3 would need a new control block to be created. Option 2 would require a new string to be created.

In fact, option 3 has the advantage that it allows any continuous
substring - head, mid, or tail - to be referred to without making a copy
of the required part of the string.

Options 2 and 3 make it fast to find the length. They also allow any
value (i.e. including zero) to be part of the string.

So: Which of those should a compiler support? Should it support more
than one form? If so, should the language allow the programmer to
specify which form to use on any particular string?

If that's not complicated enough, the above essentially considers
strings whose contents could be read-only or read-write but their
lengths don't change. If the lengths can change then there are
additional issues of storage management. Eek! ;)

For lower level strings, I'd highly recommend using zero-terminated
strings, or using them as the basis, or at least having it as an option.

This is not the 'C way', as I'd long used this outside of C and Unix
(eg. in DEC assembly, and in my own stuff for at least a decode before I
first dealt with C.

I still use them, and among many advantages such as pure simplicity,
allow you to directly make use of innumerable APIs that specify such
strings.

They can be used in contexts such as the compact string fields of
structs, since the only overhead is allowing space for that terminator **.

The next step up, in lower level code, is to use a slice. This is a
(pointer, length) descriptor. Here no terminator is necessary, and
allows strings to also contain embedded zeros (so can contain any binary
data).

String slices can point into another string (allowing sharing), or into
another slice, or into a regular zero-terminated string.

However to call an API function expecting a zero-terminated string
('stringz` as I sometimes call it), the pointer is not enough: you need
to ensure there's a zero following those <length> characters!

Within my dynamic scripting language, I have a full-on counted string
type, with reference counting to manage sharing and allow automatic
memory management. But with the same headache when calling low-level FFI functions that expect C-like strings.

But that language at least will cope with it.

(** The scripting language can define structs with fixed types including fixed-width string fields. Those are defined in two ways:

stringz*8 A
stringc*8 B

Both A and B occupy an 8-byte field. But A can store a maximum string of
7 characters, with B it can be 8 characters.

Yet B also includes the count so no scanning is needed to determine the
string length. The scheme however only works on fields of 2 to 256
characters.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Dmitry A. Kazakov on Wed Oct 26 20:43:11 2022

On 24/10/2022 13:07, Dmitry A. Kazakov wrote:

On 2022-10-24 12:31, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of
characters to be stored?

1. There's the C way, of course, of reserving one value (zero) and
using it as a terminator.

2. There's the 'length prefix' option of putting the length of the
string in a machine word before the characters.

3. There's the 'double pointer' way of pointing at, say, first and
past (where 'past' means first plus length such that the second
pointer points one position beyond the last character).

Any others?

4. String body only. The constraints are known outside.

This is the way string slices and fixed length strings are implemented.
In the later case the compiler knows the strings bounds (first and last indices and thus the length). In the former case the compiler passes a "string dope" along with the naked body. The dope contains the bounds.

That doesn't seem meaningfully different from case 3. To be clear, case
3 would be represented by, in addition to the bytes of the string,

struct
first: pointer to first byte of string
past: pointer to byte after last byte of string
.... other fields ....
end struct

The string length would be past - first. The bytes of the string would
be those pointed at (which I presume is what you are calling the naked
body).

This has an effect on pointers. E.g. if you want slices and efficient
raw strings you must distinguish pointers to definite (constrained) vs. indefinite (unconstrained) objects of same type.

E.g. in Ada you cannot take an indefinite string pointer to a fixed
length string because there is no bounds. If you wanted that feature you would use a "fat pointer" to carry bounds with it.

Any reason you'd recommend against storing bounds as in the struct, above?

This is similar to atomic, volatile objects and pointers to. The
mechanics is same. You cannot take a general-purpose pointer to an
atomic object, because the client code would not know that it should
take care upon dereferencing.

I am not sure what that means. I guess the point you are making is that
there are levels of classification which don't affect the data type but
they do affect how it can be accessed - with the language needing to
prevent a reference weakening the storage model. For example, a
read-write reference to a substring should be prevented from being used
to access part of a string which is supposed to be read-only.

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Stefan Ram on Wed Oct 26 21:52:06 2022

On 24/10/2022 12:44, Stefan Ram wrote:

James Harris <james.harris.1@gmail.com> writes:

So: Which of those should a compiler support? Should it support more
than one form? If so, should the language allow the programmer to
specify which form to use on any particular string?

I think the idea of C is to leave it up to the programmer.
The C string literals and functions are just some kind of
suggestion, and they help to provide basic services, such
as printing some text to the terminal. But otherwise, the
programmer is free to implement his own string type(s) or
use string libraries.

The choice depends on the expected type of use. For example,
some ways to store strings are known as "ropes" (Hans J Boehm,
1994), others are known as "gap buffers". A text editor
might simultaneously use ropes for its text buffers and
C strings for filenames.

Thanks for the references.

The crucial thing for allowing programmers to implement
their own string type is that the languages is fast enough
to do this with little overhead compared to an implementation
of strings in the langugage itself. Implementing custom
string representations in slow languages might not feasible.

That's fair. I would like, however, to have an inbuilt string type that
is easy to work with so that there's a pre-made standard and programmers
don't have to come up with their own or to spend time working out what a previous programmer had created.

I should have suggested a string or slice interface. Here's a first
attempt at the operations a string would be expected to be hit with.

These are part of the mechanics of string handling, relating to the
structure of the string rather than to its contents, so I've not
included anything in this list which looks at the content of the string.
There are loads of content-based operations such as string comparisons,
case conversion, whitespace trimming, etc, which could be built on top
of the basic handling.

Potential operations on string structures:
* allocate a new string
* create a slice (view) of an existing string
* index into a string
* increase the size of a string
* reduce the size of a string
* return the length of the string
* append/delete characters from the end
* insert/delete characters at the beginning
* take a slice of a string
* concatenate strings (including copying)
* pass to and from functions

The idea of slices is that they would appear the be strings but could be created to refer to the same string elements without allocating new
storage for the sliced data.

These are just some ideas on what might be required. To do this
comprehensively seems rather complicated! :(

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Bart on Wed Oct 26 21:33:59 2022

On 24/10/2022 15:28, Bart wrote:

On 24/10/2022 11:31, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of
characters to be stored?

..

For lower level strings, I'd highly recommend using zero-terminated
strings, or using them as the basis, or at least having it as an option.

They certainly seem easiest to work with although they do have
limitations such as:

* cannot include a character with the encoding of zero (as you say)
* must be scanned to determine length
* awkward to add to or delete from the end of as they don't carry any
data about whether the memory immediately following is available or not

This is not the 'C way', as I'd long used this outside of C and Unix
(eg. in DEC assembly, and in my own stuff for at least a decode before I first dealt with C.

True, though C popularised the scheme. Besides, on the PDP one way of
storing strings was apparently as

(length, address)

That's according to the Commercial Instruction Set (CIS) part of

https://en.wikipedia.org/wiki/PDP-11_architecture

I still use them, and among many advantages such as pure simplicity,
allow you to directly make use of innumerable APIs that specify such
strings.

They can be used in contexts such as the compact string fields of
structs, since the only overhead is allowing space for that terminator **.

OK.

The next step up, in lower level code, is to use a slice. This is a
(pointer, length) descriptor. Here no terminator is necessary, and
allows strings to also contain embedded zeros (so can contain any binary data).

String slices can point into another string (allowing sharing), or into another slice, or into a regular zero-terminated string.

That's more universal and therefore perhaps the best to implement if
only one scheme is to be available. Have to say, though, I guess it
would be hard to manage the memory for. Instead of just (first, length)
or (first, past) perhaps one would need something like

struct
first: pointer to first element
past: pointer just past last element
count: number of slices pointing to this slice/string
base: the parent string or memory
flags: various
end struct

The base field would refer to the string object we were a slice of or,
if we were not a slice but the base string, the memory area in which the
string was stored.

The flags would indicate whether the string/slice could have its
contents changed and whether it could have its length changed, whether
the contents could be moved in memory, etc.

However to call an API function expecting a zero-terminated string
('stringz` as I sometimes call it), the pointer is not enough: you need
to ensure there's a zero following those <length> characters!

Within my dynamic scripting language, I have a full-on counted string
type, with reference counting to manage sharing and allow automatic
memory management.

What fields did you use to manage such stuff? Am I on the right lines
with the ideas above?

But with the same headache when calling low-level FFI
functions that expect C-like strings.

Just a thought: ensure there is always at least one more byte of memory
than the string requires and put a zero byte at the end of the string
before calling any function which expects a C-like string. (User
responsibility to ensure there are no zero bytes embedded in the string.)

Perhaps one reason is that some predefined data structures include a fixed-length field in which a string can sit but which has no room for
another byte such as a terminating zero. But for them the string could
be copied out.

Having a string defined by a (first, past) pair would perhaps allow fixed-length fields to be handled as easily as mutable strings.

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dmitry A. Kazakov@21:1/5 to James Harris on Thu Oct 27 09:28:51 2022

On 2022-10-26 21:43, James Harris wrote:

On 24/10/2022 13:07, Dmitry A. Kazakov wrote:

On 2022-10-24 12:31, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of
characters to be stored?

1. There's the C way, of course, of reserving one value (zero) and
using it as a terminator.

2. There's the 'length prefix' option of putting the length of the
string in a machine word before the characters.

3. There's the 'double pointer' way of pointing at, say, first and
past (where 'past' means first plus length such that the second
pointer points one position beyond the last character).

Any others?

4. String body only. The constraints are known outside.

This is the way string slices and fixed length strings are
implemented. In the later case the compiler knows the strings bounds
(first and last indices and thus the length). In the former case the
compiler passes a "string dope" along with the naked body. The dope
contains the bounds.

That doesn't seem meaningfully different from case 3. To be clear, case
3 would be represented by, in addition to the bytes of the string,

struct
first: pointer to first byte of string
past: pointer to byte after last byte of string
.... other fields ....
end struct

The string length would be past - first. The bytes of the string would
be those pointed at (which I presume is what you are calling the naked
body).

That is the structure of a string dope, not the string itself, unless
you have the body in other fields, but then why would you need pointers?

To clarify terms. String representation must include the string body if
we are talking about values of strings. The things like pointers and
vectorized dopes are references to a string, not strings. You can pass a
string by a reference, sure. But the string value is somewhere else.
What you pass is not a string it is a substitute.

This has an effect on pointers. E.g. if you want slices and efficient
raw strings you must distinguish pointers to definite (constrained)
vs. indefinite (unconstrained) objects of same type.

E.g. in Ada you cannot take an indefinite string pointer to a fixed
length string because there is no bounds. If you wanted that feature
you would use a "fat pointer" to carry bounds with it.

Any reason you'd recommend against storing bounds as in the struct, above?

Start with interoperability of strings and slices of. The crucial
requirements would be:

A slice can be passed to a subprogram expecting a string without
copying.

Consider efficiency and low-level close to hardware stuff:

Aggregation of strings with known bounds does not require storing them.

E.g. you can have arrays of fixed length strings (like an image buffer).
If a member of a structure is a fixed length string, no bounds are
stored. A pointer to a fixed length string is a plain pointer etc.

This is similar to atomic, volatile objects and pointers to. The
mechanics is same. You cannot take a general-purpose pointer to an
atomic object, because the client code would not know that it should
take care upon dereferencing.

I am not sure what that means. I guess the point you are making is that
there are levels of classification which don't affect the data type but
they do affect how it can be accessed - with the language needing to
prevent a reference weakening the storage model. For example, a
read-write reference to a substring should be prevented from being used
to access part of a string which is supposed to be read-only.

Yes, it is a type constraint. There are all sorts of constraints one
could put on a type in order to produce a constrained subtype.
Constraining limits operations, e.g. immutability removes mutators. It
also directs certain implementations like using locking instructions or dropping known bounds.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Ram@21:1/5 to James Harris on Thu Oct 27 08:58:50 2022

James Harris <james.harris.1@gmail.com> writes:

Potential operations on string structures:
* allocate a new string
* create a slice (view) of an existing string
* index into a string

Many of such operations are provided by the standard library
of C++. You could have a look at its implementation. One might
even think of kinda "backporting" it to C. Or use C++.

Suggested Video: "The strange details of std::string at
Facebook" - Nicholas Ormrod (2016)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to James Harris on Thu Oct 27 12:14:28 2022

On 26/10/2022 21:33, James Harris wrote:

On 24/10/2022 15:28, Bart wrote:

On 24/10/2022 11:31, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of
characters to be stored?

..

For lower level strings, I'd highly recommend using zero-terminated
strings, or using them as the basis, or at least having it as an option.

They certainly seem easiest to work with although they do have
limitations such as:

* cannot include a character with the encoding of zero (as you say)
* must be scanned to determine length

I use such strings in my 1Mlps compilers. Note that in that application:

* You don't need to get string lengths that often
* The vast majority of strings are short
* String append (that you mention below) is mainly when speed is not
critical (eg. for diagnostics)

* awkward to add to or delete from the end of as they don't carry any
data about whether the memory immediately following is available or not

The next step up, in lower level code, is to use a slice. This is a
(pointer, length) descriptor. Here no terminator is necessary, and
allows strings to also contain embedded zeros (so can contain any
binary data).

String slices can point into another string (allowing sharing), or
into another slice, or into a regular zero-terminated string.

That's more universal and therefore perhaps the best to implement if
only one scheme is to be available.

Most strings are fixed-length once created; strings that can grow are
rare. You don't need a 'capacity' field for example (like C++'s Vector
type).

But managing memory can still be an issue because you don't know if a particular slice owns its memory, or points to a string literal, or
points into a shared string, or points to external memory.

So a simple slice suits a lower-level language where you do this manual
(it would be a welcome addition to C for example).

My main language is just like this.

Have to say, though, I guess it

would be hard to manage the memory for. Instead of just (first, length)
or (first, past) perhaps one would need something like

struct
    first: pointer to first element
    past: pointer just past last element
    count: number of slices pointing to this slice/string
    base: the parent string or memory
    flags: various
end struct

The base field would refer to the string object we were a slice of or,
if we were not a slice but the base string, the memory area in which the string was stored.

The flags would indicate whether the string/slice could have its
contents changed and whether it could have its length changed, whether
the contents could be moved in memory, etc.

However to call an API function expecting a zero-terminated string
('stringz` as I sometimes call it), the pointer is not enough: you
need to ensure there's a zero following those <length> characters!

Within my dynamic scripting language, I have a full-on counted string
type, with reference counting to manage sharing and allow automatic
memory management.

What fields did you use to manage such stuff? Am I on the right lines
with the ideas above?

The structure I use is not lightweight because it is for interpreted
code. The following object descriptor is a 32-byte record, used for all objects. I've shown only the fields used by string objects:

record objrec =
u32 refcount
byte mutable # 1 for mutable strings
byte objtype
u16 dummy

ichar strptr # (ref char)
u64 length
union
u64 alloc64
object objptr2 # (ref objptr)
end
end

The string data itself is separate, pointed to by 'strptr'. This is nil
when the length is zero (it doesn't point to ""). It is not
zero-terminated (unless an external slice happens to be).

Most strings are mutable, then .alloc64 gives the capacity of the
allocation.

An important field is objtype; its values are:

Normal Regular string (uses alloc64)
Slice Slice into another (uses objptr2)
Extslice Strings lie outside the object scheme

For slices, while .strptr refers to the string data in question,
.objptr2 refs to the owner object of that string, which has its own
refcount.

External strings are those that belong to external code (eg. from an FFI function), or those occuring inside a packed struct field for example.

So .objtype is used when sharing or freeing string data.

As I said, this is for interpreted code which can afford to do this
fiddly checking at runtime, which is not done inline either.

For static languages using inline code, it might need to be more
streamlined.

Note that if you take those 32 bytes, then the middle 16 bytes (.strptr
and .length fields) correspond to a raw Slice as used in my lower level language.

But with the same headache when calling low-level FFI functions that
expect C-like strings.

Just a thought: ensure there is always at least one more byte of memory
than the string requires and put a zero byte at the end of the string
before calling any function which expects a C-like string. (User responsibility to ensure there are no zero bytes embedded in the string.)

I think I tried that once. In general it doesn't work, as you might have
a slice into another string; you can't inject a zero byte into the
middle of that other string!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Ram@21:1/5 to Stefan Ram on Thu Oct 27 10:19:34 2022

ram@zedat.fu-berlin.de (Stefan Ram) writes:

Many of such operations are provided by the standard library
of C++. You could have a look at its implementation. One might
even think of kinda "backporting" it to C. Or use C++.

One could also look at the implementation of strings in
Python. Python already is a library that can be used
from C. So, one could use Python in C as a library just
for its data types or just for string handling.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Dmitry A. Kazakov on Sat Oct 29 12:23:10 2022

On 27/10/2022 08:28, Dmitry A. Kazakov wrote:

On 2022-10-26 21:43, James Harris wrote:

On 24/10/2022 13:07, Dmitry A. Kazakov wrote:

On 2022-10-24 12:31, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of
characters to be stored?

1. There's the C way, of course, of reserving one value (zero) and
using it as a terminator.

2. There's the 'length prefix' option of putting the length of the
string in a machine word before the characters.

3. There's the 'double pointer' way of pointing at, say, first and
past (where 'past' means first plus length such that the second
pointer points one position beyond the last character).

Any others?

4. String body only. The constraints are known outside.

This is the way string slices and fixed length strings are
implemented. In the later case the compiler knows the strings bounds
(first and last indices and thus the length). In the former case the
compiler passes a "string dope" along with the naked body. The dope
contains the bounds.

That doesn't seem meaningfully different from case 3. To be clear,
case 3 would be represented by, in addition to the bytes of the string,

struct
   first: pointer to first byte of string
   past: pointer to byte after last byte of string
   .... other fields ....
end struct

The string length would be past - first. The bytes of the string would
be those pointed at (which I presume is what you are calling the naked
body).

That is the structure of a string dope, not the string itself, unless
you have the body in other fields, but then why would you need pointers?

Curious use of terms. I presume that by "dope" you mean a dope vector
which can also be called a control block or a descriptor.

As for this specific case, the same information can be conveyed in
different ways: (start, length), (start, memsize), (first, last),
(first, past). I chose the latter as it should be slightly faster than
the others and does not run into problems when the elements are other
than single bytes.

To explain, for common operations,

memsize() = past - first
length() = memsize() >> alignbits
forward iteration proceeds while address < past
backward iteration proceeds while address >= first

The only stipulation is that the body must not be allocated at the very
top or bottom of the addressable range.

Using (first, past) should be as simple as that. By contrast, the
similar (first, last) runs into a slight problem when elements are wider
than single bytes: should the last pointer point to the start or the end
of the last item?

The others, which involve memsize or length, make it slightly slower to
judge the limits of iteration in the general case, requiring a
calculation to see if a pointer is outside the limits of the string
being referred to.

To clarify terms. String representation must include the string body if
we are talking about values of strings. The things like pointers and vectorized dopes are references to a string, not strings. You can pass a string by a reference, sure. But the string value is somewhere else.
What you pass is not a string it is a substitute.

That depends, surely, on how "a string" is defined. If strings are
defined as descriptors starting with the fields first and past then the
bodies of such strings can be elsewhere. (There would be other fields of
a string descriptor to assist with memory management and probably some
flags, though I am open to suggestions as to what those fields should be.)

This has an effect on pointers. E.g. if you want slices and efficient
raw strings you must distinguish pointers to definite (constrained)
vs. indefinite (unconstrained) objects of same type.

E.g. in Ada you cannot take an indefinite string pointer to a fixed
length string because there is no bounds. If you wanted that feature
you would use a "fat pointer" to carry bounds with it.

Any reason you'd recommend against storing bounds as in the struct,
above?

Start with interoperability of strings and slices of. The crucial requirements would be:

   A slice can be passed to a subprogram expecting a string without copying.

Indeed, that's a major benefit of slices, IMO, being able to pass
something which looks and acts like a string but which doesn't need the elements of the string to be copied.

That said, a slice would probably have a length which the callee can
determine but which the callee cannot change. I presume that's what
you'd call a constraint.

If a callee wanted to be able to change the length of a string then it
would have to be passed a real string, not a slice.

I guess there would be these kinds of string argument:

1. Read-write string. Anything could be done to the string by the
callee. (Would have to be a real string.)

2. Read-write fixed-length string. The string's contents could be
altered but it could not be made longer or shorter. (Could be a real
string or a slice.)

3. Read-only string. Neither its length nor it contents could be altered
by the callee. (Could be a real string or a slice.)

Consider efficiency and low-level close to hardware stuff:

   Aggregation of strings with known bounds does not require storing them.

E.g. you can have arrays of fixed length strings (like an image buffer).
If a member of a structure is a fixed length string, no bounds are
stored. A pointer to a fixed length string is a plain pointer etc.

You mean the string bounds could be known at compile time, say, rather
than at run time. Good point. Any suggestions on how that should be implemented?

This is similar to atomic, volatile objects and pointers to. The
mechanics is same. You cannot take a general-purpose pointer to an
atomic object, because the client code would not know that it should
take care upon dereferencing.

I am not sure what that means. I guess the point you are making is
that there are levels of classification which don't affect the data
type but they do affect how it can be accessed - with the language
needing to prevent a reference weakening the storage model. For
example, a read-write reference to a substring should be prevented
from being used to access part of a string which is supposed to be
read-only.

Yes, it is a type constraint. There are all sorts of constraints one
could put on a type in order to produce a constrained subtype.
Constraining limits operations, e.g. immutability removes mutators. It
also directs certain implementations like using locking instructions or dropping known bounds.

Was with you all the way until you mentioned dropping known bounds. What
does that mean? How can it be legitimate to drop any bounds?

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to James Harris on Sat Oct 29 13:24:45 2022

On 29/10/2022 12:23, James Harris wrote:

On 27/10/2022 08:28, Dmitry A. Kazakov wrote:

On 2022-10-26 21:43, James Harris wrote:

On 24/10/2022 13:07, Dmitry A. Kazakov wrote:

On 2022-10-24 12:31, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of
characters to be stored?

1. There's the C way, of course, of reserving one value (zero) and
using it as a terminator.

2. There's the 'length prefix' option of putting the length of the
string in a machine word before the characters.

3. There's the 'double pointer' way of pointing at, say, first and
past (where 'past' means first plus length such that the second
pointer points one position beyond the last character).

Any others?

4. String body only. The constraints are known outside.

This is the way string slices and fixed length strings are
implemented. In the later case the compiler knows the strings bounds
(first and last indices and thus the length). In the former case the
compiler passes a "string dope" along with the naked body. The dope
contains the bounds.

That doesn't seem meaningfully different from case 3. To be clear,
case 3 would be represented by, in addition to the bytes of the string,

struct
   first: pointer to first byte of string
   past: pointer to byte after last byte of string
   .... other fields ....
end struct

The string length would be past - first. The bytes of the string
would be those pointed at (which I presume is what you are calling
the naked body).

That is the structure of a string dope, not the string itself, unless
you have the body in other fields, but then why would you need pointers?

Curious use of terms. I presume that by "dope" you mean a dope vector
which can also be called a control block or a descriptor.

As for this specific case, the same information can be conveyed in
different ways: (start, length), (start, memsize), (first, last),
(first, past). I chose the latter as it should be slightly faster than
the others and does not run into problems when the elements are other
than single bytes.

To explain, for common operations,

memsize() = past - first
length() = memsize() >> alignbits
forward iteration proceeds while address < past
backward iteration proceeds while address >= first

The only stipulation is that the body must not be allocated at the very
top or bottom of the addressable range.

Using (first, past) should be as simple as that. By contrast, the
similar (first, last) runs into a slight problem when elements are wider
than single bytes: should the last pointer point to the start or the end
of the last item?

The others, which involve memsize or length, make it slightly slower to
judge the limits of iteration in the general case, requiring a
calculation to see if a pointer is outside the limits of the string
being referred to.

To clarify terms. String representation must include the string body
if we are talking about values of strings. The things like pointers
and vectorized dopes are references to a string, not strings. You can
pass a string by a reference, sure. But the string value is somewhere
else. What you pass is not a string it is a substitute.

That depends, surely, on how "a string" is defined. If strings are
defined as descriptors starting with the fields first and past then the bodies of such strings can be elsewhere. (There would be other fields of
a string descriptor to assist with memory management and probably some
flags, though I am open to suggestions as to what those fields should be.)

This has an effect on pointers. E.g. if you want slices and
efficient raw strings you must distinguish pointers to definite
(constrained) vs. indefinite (unconstrained) objects of same type.

E.g. in Ada you cannot take an indefinite string pointer to a fixed
length string because there is no bounds. If you wanted that feature
you would use a "fat pointer" to carry bounds with it.

Any reason you'd recommend against storing bounds as in the struct,
above?

Start with interoperability of strings and slices of. The crucial
requirements would be:

    A slice can be passed to a subprogram expecting a string without
copying.

Indeed, that's a major benefit of slices, IMO, being able to pass
something which looks and acts like a string but which doesn't need the elements of the string to be copied.

That said, a slice would probably have a length which the callee can determine but which the callee cannot change. I presume that's what
you'd call a constraint.

If a callee wanted to be able to change the length of a string then it
would have to be passed a real string, not a slice.

I guess there would be these kinds of string argument:

1. Read-write string. Anything could be done to the string by the
callee. (Would have to be a real string.)

2. Read-write fixed-length string. The string's contents could be
altered but it could not be made longer or shorter. (Could be a real
string or a slice.)

3. Read-only string. Neither its length nor it contents could be altered
by the callee. (Could be a real string or a slice.)

4. Extensible string. This is not quite the same as your (1) which
requires only a mutable string.

You can mutate a string (alter individual characters) without needing to
know the overall length or its allocated capacity.

(You might further split that into mutable/non-mutable extensible
strings. Usually if growing a string by appending to it, you don't want
to also alter existing parts of the string.)

(You probably need to consider Unicode strings too, especially if
represented as UTF8, as the meaning of 'length' needs pinning down.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Stefan Ram on Sat Oct 29 14:10:49 2022

On 27/10/2022 09:58, Stefan Ram wrote:

James Harris <james.harris.1@gmail.com> writes:

Potential operations on string structures:
* allocate a new string
* create a slice (view) of an existing string
* index into a string

Many of such operations are provided by the standard library
of C++. You could have a look at its implementation. One might
even think of kinda "backporting" it to C. Or use C++.

Suggested Video: "The strange details of std::string at
Facebook" - Nicholas Ormrod (2016)

Thanks. I had a look at that video - and a number of others.

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Bart on Sat Oct 29 15:01:34 2022

On 27/10/2022 12:14, Bart wrote:

On 26/10/2022 21:33, James Harris wrote:

On 24/10/2022 15:28, Bart wrote:

On 24/10/2022 11:31, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of
characters to be stored?

..

String slices can point into another string (allowing sharing), or
into another slice, or into a regular zero-terminated string.

That's more universal and therefore perhaps the best to implement if
only one scheme is to be available.

Most strings are fixed-length once created; strings that can grow are
rare. You don't need a 'capacity' field for example (like C++'s Vector
type).

Having watched some videos on string storage recently I now think I know
what you mean by the capacity field - basically that a string descriptor
would consist of these fields:

start
length
capacity

so that the string could be extended at the end (up to the capacity).
That may be a bit restrictive. A programmer might want to remove or add characters at the beginning rather than just at the end, even though
such would be done less often.

So what do you think of having a string descriptor more like

first
past
memfirst
mempast

where memfirst and mempast would define the allocated space in which the
string body would sit.

Or perhaps the descriptor should be unified with other references to
memory - as I understand is true of your example, below.

But managing memory can still be an issue because you don't know if a particular slice owns its memory, or points to a string literal, or
points into a shared string, or points to external memory.

Yes, some flags would be needed. They could be stored in the low-order
bits of memfirst and mempast given that:

a) allocations (hence, memfirst) could be aligned
b) sizes of allocations (hence, mempast) could be rounded up to a
suitable power of 2
c) memfirst and mempast would need to be used far less often than first
and past so there would be no great problem with the cost of masking out
the low-order bits to get the addresses.

..

Within my dynamic scripting language, I have a full-on counted string
type, with reference counting to manage sharing and allow automatic
memory management.

What fields did you use to manage such stuff? Am I on the right lines
with the ideas above?

The structure I use is not lightweight because it is for interpreted
code. The following object descriptor is a 32-byte record, used for all objects. I've shown only the fields used by string objects:

    record objrec =
        u32         refcount
        byte        mutable      # 1 for mutable strings
        byte        objtype
        u16         dummy

        ichar       strptr       # (ref char)
        u64         length
        union
            u64     alloc64
            object objptr2      # (ref objptr)
        end
    end

That looks very sensible. I have considered having 'sentient references'
which would have a common format for anything which refers to memory (especially referents of dynamic size) and would include the address of
a vtable for the specific type of sentient reference. The vtable would
hold the addresses of methods which could be applied to the reference
rather than to the referent. IOW the referent and the reference would
each have a type.

ATM I think I'd need to work through a lot more use cases before I would
be ready to settle on the details of that so for now I may just go with
the idea of a string descriptor.

The string data itself is separate, pointed to by 'strptr'. This is nil
when the length is zero (it doesn't point to ""). It is not
zero-terminated (unless an external slice happens to be).

Most strings are mutable, then .alloc64 gives the capacity of the
allocation.

An important field is objtype; its values are:

    Normal            Regular string (uses alloc64)
    Slice             Slice into another (uses objptr2)
    Extslice          Strings lie outside the object scheme

OK. I may use something like that or, possibly, some flags.

..

Note that if you take those 32 bytes, then the middle 16 bytes (.strptr
and .length fields) correspond to a raw Slice as used in my lower level language.

Good point. I'd need slices to have the same format as strings and for
both to have flags. As there's no space for flags in the (first, past)
pair I'd need to add a flags word, making the structure

first
past
misc
memfirst
mempast

where misc would store various pieces of information, not just flag
bits. Slices would have only the first three fields. Strings would have
all five. Flags would indicate whether this was a string or a slice.

For me it's too early to optimise but it's worth noting that even for
64-bit machines the above would occupy only 24 or 40 bytes of a 64-byte
cache line so short string bodies could be stored in the same line,
again with flags indicating that that was so.

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Bart on Sat Oct 29 15:16:27 2022

On 29/10/2022 13:24, Bart wrote:

On 29/10/2022 12:23, James Harris wrote:

I guess there would be these kinds of string argument:

1. Read-write string. Anything could be done to the string by the
callee. (Would have to be a real string.)

2. Read-write fixed-length string. The string's contents could be
altered but it could not be made longer or shorter. (Could be a real
string or a slice.)

3. Read-only string. Neither its length nor it contents could be
altered by the callee. (Could be a real string or a slice.)

4. Extensible string. This is not quite the same as your (1) which
requires only a mutable string.

You mean a string which can be made longer but the existing contents
could not be changed? I cannot think of a use case for that.

You can mutate a string (alter individual characters) without needing to
know the overall length or its allocated capacity.

Wouldn't you need to know how long the string was so that a callee could
make sure it was trying to modify characters within the string rather
than memory locations outside it?

(You might further split that into mutable/non-mutable extensible
strings. Usually if growing a string by appending to it, you don't want
to also alter existing parts of the string.)

Mutable and extensible are good descriptions though as above I don't yet
see the value in allowing a string to be extensible but its existing
contents to be immutable.

A slice would be inextensible but could be mutable or immutable, AISI.

(You probably need to consider Unicode strings too, especially if
represented as UTF8, as the meaning of 'length' needs pinning down.)

I haven't mentioned it but ATM my chars are 32-bit and any 32-bit value
can be stored in them, including zero. It also means there's no way to
reserve a value for EOF so that condition has to be handled a different
way from what C programmers are used to where EOF is a value which is
outside the range permitted for chars. Challenges a plenty!

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to James Harris on Sat Oct 29 16:30:13 2022

On 29/10/2022 15:16, James Harris wrote:

On 29/10/2022 13:24, Bart wrote:

On 29/10/2022 12:23, James Harris wrote:

I guess there would be these kinds of string argument:

1. Read-write string. Anything could be done to the string by the
callee. (Would have to be a real string.)

2. Read-write fixed-length string. The string's contents could be
altered but it could not be made longer or shorter. (Could be a real
string or a slice.)

3. Read-only string. Neither its length nor it contents could be
altered by the callee. (Could be a real string or a slice.)

4. Extensible string. This is not quite the same as your (1) which
requires only a mutable string.

You mean a string which can be made longer but the existing contents
could not be changed? I cannot think of a use case for that.

That's a pattern I used all the time to incrementally build strings, for example to generate C or ASM source files from a language app.

Or it can be as simple as this:

errormess +:= " on line "+tostr(linenumber)

Once extended, the existing parts of the string are never modified.

Perhaps you can give an example of where mutating the characters of a
string, extensible or otherwise, comes in useful.

(My strings generally are mutable, but it's not a feature I use a great
deal.

For applications like text editors, I use a list of strings, one per
line. And editing within each line create a new string for each edit. Efficiency here is not critical, and the needs are diverse, like
deleting within the string, or insertion. It's just easier to construct
a new one.)

You can mutate a string (alter individual characters) without needing
to know the overall length or its allocated capacity.

Wouldn't you need to know how long the string was so that a callee could
make sure it was trying to modify characters within the string rather
than memory locations outside it?

My point is that, given only the string pointer and an index or offset
into it, that's all that's needed to modify it. If slices at least are
used, then the callee could do bounds checking /if it wanted/.

(My dynamic language does do runtime checking of bounds but, once an application has been developed, it is very, very rare that I have a
bounds error come up. In a working, debugged program, it should not be necessary.)

(You might further split that into mutable/non-mutable extensible
strings. Usually if growing a string by appending to it, you don't
want to also alter existing parts of the string.)

Mutable and extensible are good descriptions though as above I don't yet
see the value in allowing a string to be extensible but its existing
contents to be immutable.

A slice would be inextensible but could be mutable or immutable, AISI.

(You probably need to consider Unicode strings too, especially if
represented as UTF8, as the meaning of 'length' needs pinning down.)

I haven't mentioned it but ATM my chars are 32-bit and any 32-bit value
can be stored in them, including zero. It also means there's no way to reserve a value for EOF so that condition has to be handled a different
way from what C programmers are used to where EOF is a value which is
outside the range permitted for chars. Challenges a plenty!

But you're not using all 2**32 bit patterns? It could reserve -1 or all
1s for EOF just like C does. Because EOF would generally be used for character-at-a-time streaming, which is typically 8-bit anyway.

Or have you developed a binary file system which works with 32-bit-wide 'bytes'?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Bart on Sat Oct 29 18:42:51 2022

On 29/10/2022 16:30, Bart wrote:

On 29/10/2022 15:16, James Harris wrote:

On 29/10/2022 13:24, Bart wrote:

On 29/10/2022 12:23, James Harris wrote:

I guess there would be these kinds of string argument:

1. Read-write string. Anything could be done to the string by the
callee. (Would have to be a real string.)

2. Read-write fixed-length string. The string's contents could be
altered but it could not be made longer or shorter. (Could be a real
string or a slice.)

3. Read-only string. Neither its length nor it contents could be
altered by the callee. (Could be a real string or a slice.)

4. Extensible string. This is not quite the same as your (1) which
requires only a mutable string.

You mean a string which can be made longer but the existing contents
could not be changed? I cannot think of a use case for that.

That's a pattern I used all the time to incrementally build strings, for example to generate C or ASM source files from a language app.

Or it can be as simple as this:

errormess +:= " on line "+tostr(linenumber)

Once extended, the existing parts of the string are never modified.

Good examples. The 'extend' permission seems a bit specific although I
accept that the uses you mention are common. I suppose it adds to the
security of the language to be able to designate a string as extensible/inextensible separately from designating whether its existing contents can be changed or not.

How would it be used? Thinking about functions which take a string as
input, most strings would be purely inputs. They would therefore be both read-only and inextensible within the called function. Such arguments
could be strings or slices.

Further, functions which /return/ a string would create the string and
return it whole.

It is only functions which /modify/ a string, i.e. take it as an inout parameter, where it would matter whether the string was read/write or extensible. For an inout string what should be the defaults? If we say
an inout string defaults to immutable and inextensible then that would
lead to the following ways to specify a string, s, as a parameter:

f: function(s: inout string char)
f: function(s: inout string char rw)
f: function(s: inout string char ext rw)
f: function(s: inout string char ext)

Note the "ext" and "rw" attributes. The idea is that they would specify
how the string could be modified in the function. Adding rw would allow
the string's existing contents to be taken as read-write rather than
read-only. Adding ext would allow the string to be extended.

That's effectively me thinking out loud and trying out some ideas. How
does it look to you?

What about other permissions such as prepend, split, insert, delete,
etc? Perhaps it's too specific to have too many qualifiers although I
can see value in using such info to help match caller and callee. For
example, given the above one could say that as long as the callee
doesn't specify the string as ext then it could be either a string or a
slice. That is appealing from a security perspective.

That said, can a compiler ensure that a string is not used in a way
which breaks the contract indicated by its keywords? You raise some big
issues!

Perhaps you can give an example of where mutating the characters of a
string, extensible or otherwise, comes in useful.

I intend a string to be simply an array whose length can be changed. The
idea being that a program could have a string of integers, a string of
floats etc just as easily as having a string of characters. As such,
anything which changes the content of an array should also work on
strings. For example, one might want to sort an array in place. As a
string of characters one might want to convert lower case to upper case,
etc.

(My strings generally are mutable, but it's not a feature I use a great
deal.

For applications like text editors, I use a list of strings, one per
line. And editing within each line create a new string for each edit. Efficiency here is not critical, and the needs are diverse, like
deleting within the string, or insertion. It's just easier to construct
a new one.)

OK.

..

(You might further split that into mutable/non-mutable extensible
strings. Usually if growing a string by appending to it, you don't
want to also alter existing parts of the string.)

Mutable and extensible are good descriptions though as above I don't
yet see the value in allowing a string to be extensible but its
existing contents to be immutable.

A slice would be inextensible but could be mutable or immutable, AISI.

(You probably need to consider Unicode strings too, especially if
represented as UTF8, as the meaning of 'length' needs pinning down.)

I haven't mentioned it but ATM my chars are 32-bit and any 32-bit
value can be stored in them, including zero. It also means there's no
way to reserve a value for EOF so that condition has to be handled a
different way from what C programmers are used to where EOF is a value
which is outside the range permitted for chars. Challenges a plenty!

But you're not using all 2**32 bit patterns? It could reserve -1 or all
1s for EOF just like C does. Because EOF would generally be used for character-at-a-time streaming, which is typically 8-bit anyway.

As above, the language is meant to treat strings as arrays. So AISI it
should not ascribe any particular meaning to their contents.

There are other ways. For example, my plan for EOF is twofold:

1. to have it as an attribute of a file object

2. to have an attempt to read at EOF throw a weak exception which would
be a catchable way to end an iteration.

Or have you developed a binary file system which works with 32-bit-wide 'bytes'?

No, my system is nothing like that advanced. At present all bytes
(octets) I read from disk are zero extended to 32 bits. And all chars I
write to disk have their top 24 zero bits chopped off. Though please
don't think that's by design. It's only a temporary measure while I get
the compiler up and running properly. (The compiler and the compilable
language are, at present, rather limited.) In the long term IO streams
should be via typed channels where chars of octets (or some other size)
could be handled natively.

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dmitry A. Kazakov@21:1/5 to James Harris on Sat Oct 29 22:17:27 2022

On 2022-10-29 13:23, James Harris wrote:

On 27/10/2022 08:28, Dmitry A. Kazakov wrote:

On 2022-10-26 21:43, James Harris wrote:

On 24/10/2022 13:07, Dmitry A. Kazakov wrote:

On 2022-10-24 12:31, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of
characters to be stored?

1. There's the C way, of course, of reserving one value (zero) and
using it as a terminator.

2. There's the 'length prefix' option of putting the length of the
string in a machine word before the characters.

3. There's the 'double pointer' way of pointing at, say, first and
past (where 'past' means first plus length such that the second
pointer points one position beyond the last character).

Any others?

4. String body only. The constraints are known outside.

This is the way string slices and fixed length strings are
implemented. In the later case the compiler knows the strings bounds
(first and last indices and thus the length). In the former case the
compiler passes a "string dope" along with the naked body. The dope
contains the bounds.

That doesn't seem meaningfully different from case 3. To be clear,
case 3 would be represented by, in addition to the bytes of the string,

struct
   first: pointer to first byte of string
   past: pointer to byte after last byte of string
   .... other fields ....
end struct

The string length would be past - first. The bytes of the string
would be those pointed at (which I presume is what you are calling
the naked body).

That is the structure of a string dope, not the string itself, unless
you have the body in other fields, but then why would you need pointers?

Curious use of terms. I presume that by "dope" you mean a dope vector
which can also be called a control block or a descriptor.

As for this specific case, the same information can be conveyed in
different ways: (start, length), (start, memsize), (first, last),
(first, past). I chose the latter as it should be slightly faster than
the others and does not run into problems when the elements are other
than single bytes.

Yes. The problem with (first, next) is that next could be inexpressible.
Most difficulties arise with strings/arrays over enumerations and
modular types. (first, last) has no such problem.

Both have issues with empty strings, e.g. with a multitude of
representations of. Compare with +/-0 problem for non-2-complement integers.

Using (first, past) should be as simple as that. By contrast, the
similar (first, last) runs into a slight problem when elements are wider
than single bytes: should the last pointer point to the start or the end
of the last item?

Just do not use multibyte representations at all. E.g. UTF-8 string is represented by an array of *octets*. It has a view of an array of code
points, but that is not the physical representation, only a view.

To clarify terms. String representation must include the string body
if we are talking about values of strings. The things like pointers
and vectorized dopes are references to a string, not strings. You can
pass a string by a reference, sure. But the string value is somewhere
else. What you pass is not a string it is a substitute.

That depends, surely, on how "a string" is defined.

def String is a sequence of characters.

There is not other definitions. Little depends on that because there is
no requirement to represent string this way. You are completely free to
choose any suitable representation.

[...]

Skipped description of a possible representation.

This has an effect on pointers. E.g. if you want slices and
efficient raw strings you must distinguish pointers to definite
(constrained) vs. indefinite (unconstrained) objects of same type.

E.g. in Ada you cannot take an indefinite string pointer to a fixed
length string because there is no bounds. If you wanted that feature
you would use a "fat pointer" to carry bounds with it.

Any reason you'd recommend against storing bounds as in the struct,
above?

Start with interoperability of strings and slices of. The crucial
requirements would be:

    A slice can be passed to a subprogram expecting a string without
copying.

Indeed, that's a major benefit of slices, IMO, being able to pass
something which looks and acts like a string but which doesn't need the elements of the string to be copied.

That said, a slice would probably have a length which the callee can determine but which the callee cannot change. I presume that's what
you'd call a constraint.

It could be a constraint for fixed length slices.

If a callee wanted to be able to change the length of a string then it
would have to be passed a real string, not a slice.

A callee might pass a variable length slice, which, for example, can be enlarged or shortened. Many languages with dynamically allocated strings
have this. You need to find some balance between flexibility of
pool-allocated strings and efficiency of fixed length ones. If the
language has a developed type system you can have both transparently interchangeable for the programmer. Note this is same discussion as with numbers. Programmers want all of them with an ability to pass one for
another.

I guess there would be these kinds of string argument:

1. Read-write string. Anything could be done to the string by the
callee. (Would have to be a real string.)

2. Read-write fixed-length string. The string's contents could be
altered but it could not be made longer or shorter. (Could be a real
string or a slice.)

3. Read-only string. Neither its length nor it contents could be altered
by the callee. (Could be a real string or a slice.)

Think of it in terms of constraints. Immutability is a constraint. Fixed
length is a constraint. Bounded length is a constraint. Non-sliding
lower bound is a constraint. Non-sliding upper bound is a constraint.

This should cover all spectrum. You can express all cases in terms of constraints.

Consider efficiency and low-level close to hardware stuff:

    Aggregation of strings with known bounds does not require storing
them.

E.g. you can have arrays of fixed length strings (like an image
buffer). If a member of a structure is a fixed length string, no
bounds are stored. A pointer to a fixed length string is a plain
pointer etc.

You mean the string bounds could be known at compile time, say, rather
than at run time. Good point. Any suggestions on how that should be implemented?

As I said, you just have the string body and nothing else in the representation. Compare it to numbers. You can have indefinite length
integers, but for many reasons programmers stick to constrained variants
like -2**15..2*15-1.

This is similar to atomic, volatile objects and pointers to. The
mechanics is same. You cannot take a general-purpose pointer to an
atomic object, because the client code would not know that it should
take care upon dereferencing.

I am not sure what that means. I guess the point you are making is
that there are levels of classification which don't affect the data
type but they do affect how it can be accessed - with the language
needing to prevent a reference weakening the storage model. For
example, a read-write reference to a substring should be prevented
from being used to access part of a string which is supposed to be
read-only.

Yes, it is a type constraint. There are all sorts of constraints one
could put on a type in order to produce a constrained subtype.
Constraining limits operations, e.g. immutability removes mutators. It
also directs certain implementations like using locking instructions
or dropping known bounds.

Was with you all the way until you mentioned dropping known bounds. What
does that mean? How can it be legitimate to drop any bounds?

If you can deduce bounds why would you keep them? Again, consider 16-bit integer. Do you keep -2**15 and 2*15-1 anywhere? You do not. The same
should happen to fixed length or fixed bounds strings.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to James Harris on Sat Oct 29 22:02:48 2022

On 29/10/2022 18:42, James Harris wrote:

On 29/10/2022 16:30, Bart wrote:

Further, functions which /return/ a string would create the string and
return it whole.

Not necessarily. My dynamic language can return a string which is a
slice into another. (Slices are not exposed in this language; they are
in the static one, where slices are distinct types.)

Example:

func trim(s) =
if s.len=2 then return "" fi
return s[2..$-1]
end

This trims the first and last character of string. But here it returns a
slice into the original string. If I wanted a fresh copy, I'd have to
use copy() inside the function, or copy() (or a special kind of
assignment) outside it.

It is only functions which /modify/ a string, i.e. take it as an inout parameter, where it would matter whether the string was read/write or extensible. For an inout string what should be the defaults? If we say
an inout string defaults to immutable and inextensible then that would
lead to the following ways to specify a string, s, as a parameter:

f: function(s: inout string char)
f: function(s: inout string char rw)
f: function(s: inout string char ext rw)
f: function(s: inout string char ext)

Note the "ext" and "rw" attributes. The idea is that they would specify
how the string could be modified in the function. Adding rw would allow
the string's existing contents to be taken as read-write rather than read-only. Adding ext would allow the string to be extended.

That's effectively me thinking out loud and trying out some ideas. How
does it look to you?

My preference is to keep it simple:

(1) String parameters are immutable

(2) String parameters are mutable (this is changing existing
content but also allow extension, plus deletion etc - the
works)

(3) String parameter are assignable. This means that, in addition to
(2), assigning to the parameter also replaces the caller's version

Python allows only (2), when working with Lists, and only (1) when
working with Strings (Strings are immutable, Lists are mutable)

(3) Requires full reference parameters so won't work in Python.

My scripting language allows (2) and (3) on both lists and strings. (1)
is only possible by a flag within the object that renders it immutable
(for example, passing a literal "ABC").

When I mentioned having extensibility as a different capability than
mutation, it is because this could be done via different string types.

There is in-place modification which changes the length of the object,
and modification where the length is not changed; I think these could be useful, distinct attributes.

Changing the length requires a reference to the /original/ descriptor
where all the info is stored (heap pointer, length, capacity).

But changing the contents without affecting the size either only needs
the heap pointer, or can be done with a /copy/ of the descriptor; it
doesn't not need the original.

(My first implementation of a string type, on a 16- then 32-bit machine,
used a 16-byte descriptor passed by value. The string was mutable, but
it was not possible to extend it without a proper reference.)

What about other permissions such as prepend, split, insert, delete,
etc?

These all count as in-place modifications (except split), but as I said
above, it might be useful to treat length-modifying ones differently.

It's not clear how 'split' works, but there are anyway all sorts of
string ops that are not 'in-place'; they simply create new strings.
Presumably 'split' creates 2 or more new strings.

I intend a string to be simply an array whose length can be changed.

I treat a string as one composite object normally treated as a single
value (like a record). I treat an array or list a collection of distinct objects. But this is a minor point (it affects hows [] indexing works).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to James Harris on Sat Oct 29 22:25:39 2022

On 29/10/2022 15:01, James Harris wrote:

On 27/10/2022 12:14, Bart wrote:

Most strings are fixed-length once created; strings that can grow are
rare. You don't need a 'capacity' field for example (like C++'s Vector
type).

Having watched some videos on string storage recently I now think I know
what you mean by the capacity field - basically that a string descriptor would consist of these fields:

start
length
capacity

so that the string could be extended at the end (up to the capacity).
That may be a bit restrictive. A programmer might want to remove or add characters at the beginning rather than just at the end, even though
such would be done less often.

Doing a prepend is not a problem. What's critical is whether the new
length is still within the current allocation. (Prepend requires
shifting of the old string so is less efficient anyway.)

If a new allocation is needed, you may be copying data for both prepend
and append.

With delete however, you may need to think about whether to /reduce/ the allocation size.

So what do you think of having a string descriptor more like

first
past
memfirst
mempast

where memfirst and mempast would define the allocated space in which the string body would sit.

What's the difference between 'first' and 'memfirst'? Would you have a
string that doesn't start at the beginning of its allocated block?

An important field is objtype; its values are:

     Normal            Regular string (uses alloc64)
     Slice             Slice into another (uses objptr2)
     Extslice          Strings lie outside the object scheme

OK. I may use something like that or, possibly, some flags.

..

Note that if you take those 32 bytes, then the middle 16 bytes
(.strptr and .length fields) correspond to a raw Slice as used in my
lower level language.

Good point. I'd need slices to have the same format as strings and for
both to have flags. As there's no space for flags in the (first, past)
pair I'd need to add a flags word, making the structure

first
past
misc
memfirst
mempast

where misc would store various pieces of information, not just flag
bits. Slices would have only the first three fields. Strings would have
all five. Flags would indicate whether this was a string or a slice.

For me it's too early to optimise but it's worth noting that even for
64-bit machines the above would occupy only 24 or 40 bytes of a 64-byte
cache line so short string bodies could be stored in the same line,
again with flags indicating that that was so.

I think that if your string implementation requires a 24 or 40-byte
descriptor, then thinking about cache-line optimisation /is/ premature!

I considered such a descriptor too heavyweight for my static language.
(I did incorporate such a string type once, intended for uses where
performance didn't matter: sorting out UI, printing error messages and diagnostics, that sort of thing.)

In the end I decided it did't really fit. But then I have two languages.

I guess yours likely sits between my two.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Dmitry A. Kazakov on Sun Oct 30 11:24:33 2022

On 29/10/2022 21:17, Dmitry A. Kazakov wrote:

On 2022-10-29 13:23, James Harris wrote:

..

As for this specific case, the same information can be conveyed in
different ways: (start, length), (start, memsize), (first, last),
(first, past). I chose the latter as it should be slightly faster than
the others and does not run into problems when the elements are other
than single bytes.

Yes. The problem with (first, next) is that next could be inexpressible.
Most difficulties arise with strings/arrays over enumerations and
modular types. (first, last) has no such problem.

Both have issues with empty strings, e.g. with a multitude of
representations of. Compare with +/-0 problem for non-2-complement
integers.

That sounds interesting. Do you see multiple representations of the
empty string in the following? Monospacing required. Here's how the
string "abcd" would be stored

!_a_!_b_!_c_!_d_!

^ ^
! !
first past

* so first would point at the first element of the string
* and past would point one cell beyond the last element of the string.

I don't see where you see a multitude of representations of the null
string. AISI the empty string would simply have past equal to first in
all cases.

..

That said, a slice would probably have a length which the callee can
determine but which the callee cannot change. I presume that's what
you'd call a constraint.

It could be a constraint for fixed length slices.

If a callee wanted to be able to change the length of a string then it
would have to be passed a real string, not a slice.

A callee might pass a variable length slice, which, for example, can be enlarged or shortened. Many languages with dynamically allocated strings
have this.

What is your definition of a slice? Is it /part/ of an underlying string
or is it a /copy/ of part of a string? For example, if

string S = "abcde"
slice T = S[1..3] ;"bcd"

then changes to T would do what to S?

If slice is a view of an underlying string (which is what I had in mind)
then I don't get how you could meaningfully enlarge or shorten it.

..

I guess there would be these kinds of string argument:

1. Read-write string. Anything could be done to the string by the
callee. (Would have to be a real string.)

2. Read-write fixed-length string. The string's contents could be
altered but it could not be made longer or shorter. (Could be a real
string or a slice.)

3. Read-only string. Neither its length nor it contents could be
altered by the callee. (Could be a real string or a slice.)

Think of it in terms of constraints. Immutability is a constraint. Fixed length is a constraint. Bounded length is a constraint. Non-sliding
lower bound is a constraint. Non-sliding upper bound is a constraint.

This should cover all spectrum. You can express all cases in terms of constraints.

I presume such constraints would be specified when objects are declared.
As a programmer how would you want to specify such constraints? Would
each have a reserved word, for example?

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dmitry A. Kazakov@21:1/5 to James Harris on Sun Oct 30 15:20:03 2022

On 2022-10-30 12:24, James Harris wrote:

On 29/10/2022 21:17, Dmitry A. Kazakov wrote:

On 2022-10-29 13:23, James Harris wrote:

..

As for this specific case, the same information can be conveyed in
different ways: (start, length), (start, memsize), (first, last),
(first, past). I chose the latter as it should be slightly faster
than the others and does not run into problems when the elements are
other than single bytes.

Yes. The problem with (first, next) is that next could be
inexpressible. Most difficulties arise with strings/arrays over
enumerations and modular types. (first, last) has no such problem.

Both have issues with empty strings, e.g. with a multitude of
representations of. Compare with +/-0 problem for non-2-complement
integers.

That sounds interesting. Do you see multiple representations of the
empty string in the following? Monospacing required. Here's how the
string "abcd" would be stored

!_a_!_b_!_c_!_d_!

    ^               ^
    !               !
first            past

* so first would point at the first element of the string
* and past would point one cell beyond the last element of the string.

I don't see where you see a multitude of representations of the null
string. AISI the empty string would simply have past equal to first in
all cases.

...
(0..0)
(1..1)
(2..2)
...
(n..n)
...

With pointers it becomes even worse as some of them might point to
invalid addresses.

That said, a slice would probably have a length which the callee can
determine but which the callee cannot change. I presume that's what
you'd call a constraint.

It could be a constraint for fixed length slices.

If a callee wanted to be able to change the length of a string then
it would have to be passed a real string, not a slice.

A callee might pass a variable length slice, which, for example, can
be enlarged or shortened. Many languages with dynamically allocated
strings have this.

What is your definition of a slice? Is it /part/ of an underlying string
or is it a /copy/ of part of a string? For example, if

string S = "abcde"
slice T = S[1..3] ;"bcd"

then changes to T would do what to S?

No idea. It depends. Is slice in your example an independent object?

But considering this:

declare
S : String := "abcde";
begin
S (1..3) := "x"; -- Illegal in Ada

But should it be legal, then the result would be

"xde"

Many implementations make this illegal because it would require either
bounded or dynamically allocated unbounded string.

You can consider make it legal for these, but then you would have
different semantics of slices for different strings. And this would
contradict the design principle of having all strings interchangeable regardless the implementation method.

There are contradictions in requirements you as the language designer
has to resolve this or that way.

If slice is a view of an underlying string (which is what I had in mind)
then I don't get how you could meaningfully enlarge or shorten it.

It is only your limited understanding of view as immutable and fixed
length. E.g. if you view a house in infrared why should not you be able
to open its door? Infrared googles would not limit you. Infrared photo
of a house would! (:-))

I guess there would be these kinds of string argument:

1. Read-write string. Anything could be done to the string by the
callee. (Would have to be a real string.)

2. Read-write fixed-length string. The string's contents could be
altered but it could not be made longer or shorter. (Could be a real
string or a slice.)

3. Read-only string. Neither its length nor it contents could be
altered by the callee. (Could be a real string or a slice.)

Think of it in terms of constraints. Immutability is a constraint.
Fixed length is a constraint. Bounded length is a constraint.
Non-sliding lower bound is a constraint. Non-sliding upper bound is a
constraint.

This should cover all spectrum. You can express all cases in terms of
constraints.

I presume such constraints would be specified when objects are declared.

Objects and/or subtypes. Depending on the language preferences. Note
also that you can have constrained views of the same object. E.g. you
have a mutable variable passed down as in-argument. That would be an
immutable view of the same object.

As a programmer how would you want to specify such constraints? Would
each have a reserved word, for example?

In some cases constraints might be implied. But usually language have
lots of [sub]type modifiers like

in, in out, out, constant
atomic, volatile, shared
aliased (can get pointers to)
external, static
public, private, protected (visibility constraints)
range, length, bounds
parameter AKA discriminant (general purpose constraint)
specific type AKA static/dynamic up/downcast (view as another type)
class-wide (view as a class of types rooted in this one)
...
measurement unit

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to Dmitry A. Kazakov on Sun Oct 30 14:51:53 2022

On 30/10/2022 14:20, Dmitry A. Kazakov wrote:

On 2022-10-30 12:24, James Harris wrote:

On 29/10/2022 21:17, Dmitry A. Kazakov wrote:

On 2022-10-29 13:23, James Harris wrote:

..

As for this specific case, the same information can be conveyed in
different ways: (start, length), (start, memsize), (first, last),
(first, past). I chose the latter as it should be slightly faster
than the others and does not run into problems when the elements are
other than single bytes.

Yes. The problem with (first, next) is that next could be
inexpressible. Most difficulties arise with strings/arrays over
enumerations and modular types. (first, last) has no such problem.

Both have issues with empty strings, e.g. with a multitude of
representations of. Compare with +/-0 problem for non-2-complement
integers.

That sounds interesting. Do you see multiple representations of the
empty string in the following? Monospacing required. Here's how the
string "abcd" would be stored

   !_a_!_b_!_c_!_d_!

     ^               ^
     !               !
   first            past

* so first would point at the first element of the string
* and past would point one cell beyond the last element of the string.

I don't see where you see a multitude of representations of the null
string. AISI the empty string would simply have past equal to first in
all cases.

   ...
   (0..0)
   (1..1)
   (2..2)
   ...
   (n..n)
   ...

With pointers it becomes even worse as some of them might point to
invalid addresses.

I don't know what these numbers mean. The main problem with 'first' and
'past' is that with an empty string, 'first' doesn't point anywhere, and
'past' ends up pointing to that same place, wherever that is.

I don't like it because that address is meaningless. Except possibly
when refering to an empty slice of an actual string.

   string S = "abcde"
   slice T = S[1..3] ;"bcd"

then changes to T would do what to S?

Let's try it:

s ::= "abcde" # ::= is needed to make s (and t) mutable
t := s[1..3]
t[2]:="?"

println s # a?cde
println t # a?c

The language doesn't allow an empty slice, say s[1..0], although it
ought to be well-behaved (I think it just expects j>=i in s[i..j].)

No idea. It depends. Is slice in your example an independent object?

But considering this:

declare
   S : String := "abcde";
begin
   S (1..3) := "x"; -- Illegal in Ada

But should it be legal, then the result would be

"xde"

Many implementations make this illegal because it would require either bounded or dynamically allocated unbounded string.

The language gets to say how this works. In mine it would have to be
like this:

s ::= "abcde" # ::= creates a mutable copy
s[1..3] := "xyz" # Can only insert string of matching length

s ends up as "xyzde"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Charles Lindsey@21:1/5 to James Harris on Sun Oct 30 17:01:21 2022

On 29/10/2022 18:42, James Harris wrote:

As above, the language is meant to treat strings as arrays. So AISI it should not ascribe any particular meaning to their contents.

In that case you should have a look at Algol68, where string are arrays of characters. Every array has an associated descriptor containing its bounds etc. But more importantly a REF to an array is best implemented by including the descriptor in the REF (being a strongly typed language, there is no necessity for REFs to assorted other types to have the same length - they are not necessarily just address values). This makes it easy to construct slices (immutable) and REFs to slices (so the slice is mutable), thus providing many of
the features discussed in this thread. However there is no provision for extending (or shortening) an array, other than to create a new space to copy it into; one could provide library routines with smart features to avoid actual copying in some cases, and with a friendly interface which did not expose the messiness inside.

--
Charles H. Lindsey ---------At my New Home, still doing my own thing------
Tel: +44 161 488 1845 Web: https://www.clerew.man.ac.uk Email: chl@clerew.man.ac.uk Snail-mail: Apt 40, SK8 5BF, U.K.
PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From luserdroog@21:1/5 to James Harris on Sun Oct 30 09:21:45 2022

On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of characters
to be stored?

1. There's the C way, of course, of reserving one value (zero) and using
it as a terminator.

2. There's the 'length prefix' option of putting the length of the
string in a machine word before the characters.

3. There's the 'double pointer' way of pointing at, say, first and past (where 'past' means first plus length such that the second pointer
points one position beyond the last character).

Any others?

Options 1 and 2 have the advantage that they can be referred to simply
by address. Option 3 needs an additional place in which to store the
(first, past) control block.

Option 1 has the advantage that it's easy for a program to process (by
either pointer or index).

Options 1 and 3 have the advantage that one can refer to the tail of the string (anything past the first character) without creating a copy,
although option 3 would need a new control block to be created. Option 2 would require a new string to be created.

In fact, option 3 has the advantage that it allows any continuous
substring - head, mid, or tail - to be referred to without making a copy
of the required part of the string.

Options 2 and 3 make it fast to find the length. They also allow any
value (i.e. including zero) to be part of the string.

So: Which of those should a compiler support? Should it support more
than one form? If so, should the language allow the programmer to
specify which form to use on any particular string?

If that's not complicated enough, the above essentially considers
strings whose contents could be read-only or read-write but their
lengths don't change. If the lengths can change then there are
additional issues of storage management. Eek! ;)

Recommendations welcome!

I think an exhaustive list of options would be very large if you're not pre-judging and filtering as you're adding options.

4) [List|Array|Tuple|Iterator] of character objects

5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
can be used to format the data to squeeze it into 7 bits.

6) Use UCS4 codes (24bit) padded out to 32 bits, and then you get a
whole byte for metadata attached to each character.

...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Bart on Sun Oct 30 17:33:19 2022

On 29/10/2022 22:25, Bart wrote:

On 29/10/2022 15:01, James Harris wrote:

On 27/10/2022 12:14, Bart wrote:

Most strings are fixed-length once created; strings that can grow are
rare. You don't need a 'capacity' field for example (like C++'s
Vector type).

Having watched some videos on string storage recently I now think I
know what you mean by the capacity field - basically that a string
descriptor would consist of these fields:

   start
   length
   capacity

so that the string could be extended at the end (up to the capacity).
That may be a bit restrictive. A programmer might want to remove or
add characters at the beginning rather than just at the end, even
though such would be done less often.

Doing a prepend is not a problem. What's critical is whether the new
length is still within the current allocation. (Prepend requires
shifting of the old string so is less efficient anyway.)

Well, see below.

If a new allocation is needed, you may be copying data for both prepend
and append.

Yes.

With delete however, you may need to think about whether to /reduce/ the allocation size.

Agreed.

So what do you think of having a string descriptor more like

   first
   past
   memfirst
   mempast

where memfirst and mempast would define the allocated space in which
the string body would sit.

What's the difference between 'first' and 'memfirst'?

memfirst would point to the start of the block in which the string body existed. (first would point at the same address or a later address.)

Would you have a
string that doesn't start at the beginning of its allocated block?

Yes, that would be useful in some cases. For example, if deleting the
first part of a string one wouldn't want to be forced to copy the rest
of it down.

And there are cases where strings may be built by prepending. A classic
example is construction of a network frame. Each layer adds a header
which, naturally, has to go on the front.

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Bart on Sun Oct 30 17:52:41 2022

On 29/10/2022 22:02, Bart wrote:

On 29/10/2022 18:42, James Harris wrote:

On 29/10/2022 16:30, Bart wrote:

Further, functions which /return/ a string would create the string and
return it whole.

Not necessarily. My dynamic language can return a string which is a
slice into another. (Slices are not exposed in this language; they are
in the static one, where slices are distinct types.)

Example:

    func trim(s) =
        if s.len=2 then return "" fi
        return s[2..$-1]
    end

This trims the first and last character of string. But here it returns a slice into the original string. If I wanted a fresh copy, I'd have to
use copy() inside the function, or copy() (or a special kind of
assignment) outside it.

That's a challenging example. In a sense it returns either of two
different types: the caller could be handed a string or a slice.

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to luserdroog on Sun Oct 30 18:13:42 2022

On 30/10/2022 16:21, luserdroog wrote:

On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of characters
to be stored?

1. There's the C way, of course, of reserving one value (zero) and using
it as a terminator.

2. There's the 'length prefix' option of putting the length of the
string in a machine word before the characters.

3. There's the 'double pointer' way of pointing at, say, first and past
(where 'past' means first plus length such that the second pointer
points one position beyond the last character).

Any others?

..

I think an exhaustive list of options would be very large if you're not pre-judging and filtering as you're adding options.

4) [List|Array|Tuple|Iterator] of character objects

You mean where the characters are stored individually (one per node)?

5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
can be used to format the data to squeeze it into 7 bits.

Interesting idea. It's certainly one I hadn't thought of.

6) Use UCS4 codes (24bit) padded out to 32 bits, and then you get a
whole byte for metadata attached to each character.

That's definitely thinking outside the box. I can see it working if the
user (the programmer) wanted a string of 24-bit values but it could be
awkward in other cases such as if he wanted a string of 32-bit or 8-bit
values. I don't think I mentioned it but I'd like the programmer to be
able to choose what the elements of the string would be.

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Dmitry A. Kazakov on Sun Oct 30 17:46:16 2022

On 30/10/2022 14:20, Dmitry A. Kazakov wrote:

On 2022-10-30 12:24, James Harris wrote:

..

I don't see where you see a multitude of representations of the null
string. AISI the empty string would simply have past equal to first in
all cases.

   ...
   (0..0)
   (1..1)
   (2..2)
   ...
   (n..n)
   ...

With pointers it becomes even worse as some of them might point to
invalid addresses.

In the general case strings would live at arbitrary addresses so no
meaning could be inferred from any address. In all cases

past - first

would define the length of the string. If the length was zero then it
would be an empty string.

That said, if the string could be extended then both past and first
would have to point to allocated memory into which the extension could
take place.

..

What is your definition of a slice? Is it /part/ of an underlying
string or is it a /copy/ of part of a string? For example, if

   string S = "abcde"
   slice T = S[1..3] ;"bcd"

then changes to T would do what to S?

No idea. It depends. Is slice in your example an independent object?

A slice, at the moment, at least, would be a view of part of a string. Extending the earlier example,

!_a_!_b_!_c_!_d_!

^ ^
! !
s_first s_past

The string in the example would be "abcd" and the slice, delimited by
s_first and s_past, would be the "bc" in the middle of the string. Note
that the contents of the slice would not include the element at s_past.

The slice would appear to be a string with constraints. By default its
contents could be updated in place but it could not be made longer or
shorter.

A callee which wanted only to read a string (a common case) or to update
a string in place should not have to care whether it was passed a string
or a slice. For such a case, strings and slices would be interchangeable.

But considering this:

declare
   S : String := "abcde";
begin
   S (1..3) := "x"; -- Illegal in Ada

In Ada would the following be legal?

S (1..3) := "xxx"; --replacement same size as what it is replacing

I'd be happy with that.

But should it be legal, then the result would be

"xde"

Many implementations make this illegal because it would require either bounded or dynamically allocated unbounded string.

You can consider make it legal for these, but then you would have
different semantics of slices for different strings. And this would contradict the design principle of having all strings interchangeable regardless the implementation method.

I don't mind there being differences along the lines of 'constraints'
where a less-constrained object can be passed to a callee which expects
an object with such constraints or imposes more constraints, but not one
which needs fewer constraints.

..

I presume such constraints would be specified when objects are declared.

Objects and/or subtypes. Depending on the language preferences. Note
also that you can have constrained views of the same object. E.g. you
have a mutable variable passed down as in-argument. That would be an immutable view of the same object.

Yes, and an immutable object could not be passed to a callee which
wanted a mutable object.

As a programmer how would you want to specify such constraints? Would
each have a reserved word, for example?

In some cases constraints might be implied. But usually language have
lots of [sub]type modifiers like

   in, in out, out, constant
   atomic, volatile, shared
   aliased (can get pointers to)
   external, static
   public, private, protected (visibility constraints)
   range, length, bounds
   parameter AKA discriminant (general purpose constraint)
   specific type AKA static/dynamic up/downcast (view as another type)
   class-wide (view as a class of types rooted in this one)
   ...
   measurement unit

So you wouldn't have a keyword to indicate a constraint such as
"Non-sliding lower bound" which you mentioned before but IIUC you might
have some qualification of the 'bounds' keyword as in

bounds(^..)

to indicate an unchangeable lower bound (with ^ meaning the start of the string)?

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dmitry A. Kazakov@21:1/5 to James Harris on Sun Oct 30 19:28:50 2022

On 2022-10-30 18:46, James Harris wrote:

On 30/10/2022 14:20, Dmitry A. Kazakov wrote:

On 2022-10-30 12:24, James Harris wrote:

..

I don't see where you see a multitude of representations of the null
string. AISI the empty string would simply have past equal to first
in all cases.

    ...
    (0..0)
    (1..1)
    (2..2)
    ...
    (n..n)
    ...

With pointers it becomes even worse as some of them might point to
invalid addresses.

In the general case strings would live at arbitrary addresses so no
meaning could be inferred from any address.

As I said, that is a problem which will preclude some implementations
and make others inefficient.

And in general using pointers is wasting space as strings are much shorter.

In all cases

past - first

would define the length of the string.

Not really. Usually negative length is considered equivalent to zero,
e.g. when iterating substrings. Other choices may consider iteration in reverse, when bounds are inverted.

If the length was zero then it would be an empty string.

Ergo, a special case to treat in many operations.

But considering this:

declare
    S : String := "abcde";
begin
    S (1..3) := "x"; -- Illegal in Ada

In Ada would the following be legal?

Yes, in Ada slice length is constrained as the string length is.

S (1..3) := "xxx"; --replacement same size as what it is replacing

I'd be happy with that.

It is still not fully defined. You need to consider the issue of sliding bounds. E.g.

S (2..4) (2) := 'x'; -- Assign a character

Now with sliding:

S (2..4) (2) := 'x' gives "abxde", x is second in the slice

without sliding

S (2..4) (2) := 'x' gives "axcde", x is at 2 in the original string

In Ada the right side slides, the left does not. Sliding the right side
allows doing logical things like:

S1 (1..5) := S1 (5..9); -- 5..9 slides to 1..5

But should it be legal, then the result would be

   "xde"

Many implementations make this illegal because it would require either
bounded or dynamically allocated unbounded string.

You can consider make it legal for these, but then you would have
different semantics of slices for different strings. And this would
contradict the design principle of having all strings interchangeable
regardless the implementation method.

I don't mind there being differences along the lines of 'constraints'
where a less-constrained object can be passed to a callee which expects
an object with such constraints or imposes more constraints, but not one which needs fewer constraints.

No, the problem is with semantics. E.g. let in a subprogram you do

S (1..100) := ""; -- Remove the first 100 characters

S is a formal parameter. Then, depending on the actual parameter's
subtype it may succeed (for a bounded length string) or fail (for a
fixed length string). Such things are big no-no in language design,
because they become a nightmare for programmers to track.

I presume such constraints would be specified when objects are declared.

Objects and/or subtypes. Depending on the language preferences. Note
also that you can have constrained views of the same object. E.g. you
have a mutable variable passed down as in-argument. That would be an
immutable view of the same object.

Yes, and an immutable object could not be passed to a callee which
wanted a mutable object.

Yes, lifting a constraint is not possible in most cases. However,
dynamic cast is a counterexample. You can move the view up the
inheritance tree. But this is frowned upon since it enforces certain implementations.

As a programmer how would you want to specify such constraints? Would
each have a reserved word, for example?

In some cases constraints might be implied. But usually language have
lots of [sub]type modifiers like

    in, in out, out, constant
    atomic, volatile, shared
    aliased (can get pointers to)
    external, static
    public, private, protected (visibility constraints)
    range, length, bounds
    parameter AKA discriminant (general purpose constraint)
    specific type AKA static/dynamic up/downcast (view as another type) >>     class-wide (view as a class of types rooted in this one)
    ...
    measurement unit

So you wouldn't have a keyword to indicate a constraint such as
"Non-sliding lower bound" which you mentioned before but IIUC you might
have some qualification of the 'bounds' keyword as in

bounds(^..)

to indicate an unchangeable lower bound (with ^ meaning the start of the string)?

I am not sure if sliding constraint might be usable. It is a different
issue to constraining bounds because it involves operations like
assignment. And it is not clear how to implement such a constraint
effectively. Most constraints are either static (compile time), or
simple to represent, like bounds or type tags. Sliding might be
implemented as a flag, but then you will have to check it all the time.
Maybe not worth having it as a choice. And it is unclear what is the unconstrained state, sliding or non-sliding? (:-))

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to James Harris on Sun Oct 30 22:37:36 2022

On 30/10/2022 17:52, James Harris wrote:

On 29/10/2022 22:02, Bart wrote:

On 29/10/2022 18:42, James Harris wrote:

On 29/10/2022 16:30, Bart wrote:

Further, functions which /return/ a string would create the string
and return it whole.

Not necessarily. My dynamic language can return a string which is a
slice into another. (Slices are not exposed in this language; they are
in the static one, where slices are distinct types.)

Example:

     func trim(s) =
         if s.len=2 then return "" fi
         return s[2..$-1]
     end

This trims the first and last character of string. But here it returns
a slice into the original string. If I wanted a fresh copy, I'd have
to use copy() inside the function, or copy() (or a special kind of
assignment) outside it.

That's a challenging example. In a sense it returns either of two
different types: the caller could be handed a string or a slice.

In this language, it only has a String type, not a Slice. Slicing is an operation you apply on strings to yield another String object.

(Internally, it has to distinguish between owned strings and slices into strings owned by other objects, but as I said that aspect is not exposed.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From luserdroog@21:1/5 to James Harris on Sun Oct 30 18:03:27 2022

On Sunday, October 30, 2022 at 1:13:44 PM UTC-5, James Harris wrote:

On 30/10/2022 16:21, luserdroog wrote:

On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of characters >> to be stored?

1. There's the C way, of course, of reserving one value (zero) and using >> it as a terminator.

2. There's the 'length prefix' option of putting the length of the
string in a machine word before the characters.

3. There's the 'double pointer' way of pointing at, say, first and past
(where 'past' means first plus length such that the second pointer
points one position beyond the last character).

Any others?

..

I think an exhaustive list of options would be very large if you're not pre-judging and filtering as you're adding options.

4) [List|Array|Tuple|Iterator] of character objects

You mean where the characters are stored individually (one per node)?

Yep. Fat characters, or whatever other scaffolding helps for the operations
you want to support.

5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
can be used to format the data to squeeze it into 7 bits.

Interesting idea. It's certainly one I hadn't thought of.

It's used in the Classical Forth Dictionary header for the name field which
can be variable length. Often it's followed by a length byte and you store
the pointer to the length byte, using

(length - *length)

to actually get the pointer to the start.

6) Use UCS4 codes (24bit) padded out to 32 bits, and then you get a
whole byte for metadata attached to each character.

That's definitely thinking outside the box. I can see it working if the
user (the programmer) wanted a string of 24-bit values but it could be awkward in other cases such as if he wanted a string of 32-bit or 8-bit values. I don't think I mentioned it but I'd like the programmer to be
able to choose what the elements of the string would be.

This is what I was using in my unfinished APL-like language. The 8 bits
of tag meant it was easy for a node to be a character or an integer (25 bit)
or a nested subarray or whatever ... mpfp number... symbol table.

So I didn't really need a string type *per se* because there's an array type whose data elements are these 32bits of encoded whatevs. A string is
just a 1D array, or maybe an array of arrays or a 2D array padded out with spaces on each line. You'd read it in or receive it initially as a 1D array probably
from a file. Oh, an element could also be a file.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to James Harris on Mon Oct 31 17:58:09 2022

On 30/10/2022 19:13, James Harris wrote:

On 30/10/2022 16:21, luserdroog wrote:

On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of characters >>> to be stored?

1. There's the C way, of course, of reserving one value (zero) and using >>> it as a terminator.

2. There's the 'length prefix' option of putting the length of the
string in a machine word before the characters.

3. There's the 'double pointer' way of pointing at, say, first and past
(where 'past' means first plus length such that the second pointer
points one position beyond the last character).

Any others?

..

I think an exhaustive list of options would be very large if you're not
pre-judging and filtering as you're adding options.

4) [List|Array|Tuple|Iterator] of character objects

You mean where the characters are stored individually (one per node)?

5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
can be used to format the data to squeeze it into 7 bits.

Interesting idea. It's certainly one I hadn't thought of.

Nor should you - that is a crazy idea. It is massively inefficient, as
well as being inconsistent with everything else.

6) Use UCS4 codes (24bit) padded out to 32 bits, and then you get a
whole byte for metadata attached to each character.

That's definitely thinking outside the box. I can see it working if the
user (the programmer) wanted a string of 24-bit values but it could be awkward in other cases such as if he wanted a string of 32-bit or 8-bit values. I don't think I mentioned it but I'd like the programmer to be
able to choose what the elements of the string would be.

UCS4 is 31 bit, not 24 bit. Perhaps luserdroog is mixing it up with
UTF-32, which can be covered by 21 bits. (The extra 10 bits in UCS4
have never been anything but 0, but if you want to refer to a long
out-dated and obsolete standard, it should still be done so accurately.)

Do not make a new string or character storage system based around
anything obsolete - that includes UTF-7 and UCS4. Making lots of extra
work for yourself to support something that everyone else rejected as complicated, unnecessary and unused decades ago, is just silly.

There are only two character encodings that make any kind of sense in a
modern language (i.e., anything from this century). UTF-8 and
/possibly/ UTF-32 for internal usage, if you find it more convenient for
the operations you want.

If you are using UTF-32 only for internal usage within the language, and
not for export (external function calls, file IO, etc.), then the high
byte is always unused - and therefore free for metadata if that's what
you want. I'm not convinced it is a good idea, but it's certainly possible.

For any kind of interaction with anything else, UTF-8 is the standard.
There are a few other encodings that haven't died off completely yet,
but they are all on the way out.

I would also recommend treating characters and character strings as
something very different from raw bytes and binary blobs. Users want to
do very different things with them, and many of the useful operations
are completely different. Some languages have made the mistake of
conflating the two concepts - it's difficult to fix once that design
flaw is set into a language.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to David Brown on Mon Oct 31 17:31:28 2022

On 31/10/2022 16:58, David Brown wrote:

On 30/10/2022 19:13, James Harris wrote:

On 30/10/2022 16:21, luserdroog wrote:

On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of
characters
to be stored?

1. There's the C way, of course, of reserving one value (zero) and
using
it as a terminator.

2. There's the 'length prefix' option of putting the length of the
string in a machine word before the characters.

3. There's the 'double pointer' way of pointing at, say, first and past >>>> (where 'past' means first plus length such that the second pointer
points one position beyond the last character).

Any others?

..

I think an exhaustive list of options would be very large if you're not
pre-judging and filtering as you're adding options.

4) [List|Array|Tuple|Iterator] of character objects

You mean where the characters are stored individually (one per node)?

5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
can be used to format the data to squeeze it into 7 bits.

Interesting idea. It's certainly one I hadn't thought of.

Nor should you - that is a crazy idea. It is massively inefficient, as
well as being inconsistent with everything else.

It's a perfectly fine idea - for the 1970s.

(Now the 8th bit is better put to use to represent UTF8.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Bart on Mon Oct 31 20:34:17 2022

On 31/10/2022 18:31, Bart wrote:

On 31/10/2022 16:58, David Brown wrote:

On 30/10/2022 19:13, James Harris wrote:

On 30/10/2022 16:21, luserdroog wrote:

On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of
characters
to be stored?

1. There's the C way, of course, of reserving one value (zero) and
using
it as a terminator.

2. There's the 'length prefix' option of putting the length of the
string in a machine word before the characters.

3. There's the 'double pointer' way of pointing at, say, first and
past
(where 'past' means first plus length such that the second pointer
points one position beyond the last character).

Any others?

..

I think an exhaustive list of options would be very large if you're not >>>> pre-judging and filtering as you're adding options.

4) [List|Array|Tuple|Iterator] of character objects

You mean where the characters are stored individually (one per node)?

5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7 >>>> can be used to format the data to squeeze it into 7 bits.

Interesting idea. It's certainly one I hadn't thought of.

Nor should you - that is a crazy idea. It is massively inefficient,
as well as being inconsistent with everything else.

It's a perfectly fine idea - for the 1970s.

(Now the 8th bit is better put to use to represent UTF8.)

Indeed.

UTF-7 was invented in attempt to make an encoding for Unicode that would
work for email, since some email servers did not handle 8-bit characters correctly, long ago in the dark ages. It was never formally a Unicode encoding, and almost never used in practice.

Using 7-bit characters and the eighth bit for a terminator would be
pretty inefficient - 12.5% wasted space to get a single bit of useful information per string. Not a good trade-off!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Bart on Thu Nov 3 16:10:38 2022

On 30/10/2022 22:37, Bart wrote:

On 30/10/2022 17:52, James Harris wrote:

On 29/10/2022 22:02, Bart wrote:

On 29/10/2022 18:42, James Harris wrote:

On 29/10/2022 16:30, Bart wrote:

Further, functions which /return/ a string would create the string
and return it whole.

Not necessarily. My dynamic language can return a string which is a
slice into another. (Slices are not exposed in this language; they
are in the static one, where slices are distinct types.)

Example:

     func trim(s) =
         if s.len=2 then return "" fi
         return s[2..$-1]
     end

This trims the first and last character of string. But here it
returns a slice into the original string. If I wanted a fresh copy,
I'd have to use copy() inside the function, or copy() (or a special
kind of assignment) outside it.

That's a challenging example. In a sense it returns either of two
different types: the caller could be handed a string or a slice.

In this language, it only has a String type, not a Slice. Slicing is an operation you apply on strings to yield another String object.

(Internally, it has to distinguish between owned strings and slices into strings owned by other objects, but as I said that aspect is not exposed.)

As most uses of slices would be read-only but some would be read-write,
and there are various potential ways to implement a slice, it might be
sensible for me to do something like:

1. Have the default slice as read-only and the simplest to construct. If
s is a string or another slice then a slice of part of that string could
be constructed as

s(1..4)

2. Use keywords to effect other, more specialised, operations. For example,

s.slice_cow(1..4) ;a copy-on-write slice
s.slice_rw(1..4) ;a slice via which the base string can be changed
s.copy(1..4) ;copy the designated chars to a new string

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to David Brown on Thu Nov 3 15:48:20 2022

On 31/10/2022 16:58, David Brown wrote:

On 30/10/2022 19:13, James Harris wrote:

On 30/10/2022 16:21, luserdroog wrote:

On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of
characters
to be stored?

..

5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
can be used to format the data to squeeze it into 7 bits.

Interesting idea. It's certainly one I hadn't thought of.

Nor should you - that is a crazy idea. It is massively inefficient, as
well as being inconsistent with everything else.

The model I have chosen (at least, for now) is to have a string indexed logically from zero (so indices do not need to be stored) and, for implementation, delimited by two pointers.

The one downside I am aware of is that it will, at times, require
creation and destruction of a small descriptor. I'll have to see how the approach works out in practice.

..

I would also recommend treating characters and character strings as
something very different from raw bytes and binary blobs. Users want to
do very different things with them, and many of the useful operations
are completely different. Some languages have made the mistake of conflating the two concepts - it's difficult to fix once that design
flaw is set into a language.

That sounds interesting but I cannot tell what you have in mind.

One could consider strings as having two categories of operation: those
which involve only the memory used by strings such as allocation, concatenation, insertion, deletion, etc; and those which care about the contents of a string such as capitalisation, comparison, whitespace recognition, parsing, etc. Why could the mechanics not apply to raw
bytes and blobs?

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Dmitry A. Kazakov on Thu Nov 3 15:57:09 2022

On 30/10/2022 18:28, Dmitry A. Kazakov wrote:

On 2022-10-30 18:46, James Harris wrote:

..

In Ada would the following be legal?

Yes, in Ada slice length is constrained as the string length is.

S (1..3) := "xxx"; --replacement same size as what it is replacing

I'd be happy with that.

It is still not fully defined. You need to consider the issue of sliding bounds. E.g.

S (2..4) (2) := 'x'; -- Assign a character

Now with sliding:

S (2..4) (2) := 'x' gives "abxde", x is second in the slice

without sliding

S (2..4) (2) := 'x' gives "axcde", x is at 2 in the original string

AISI the elements of strings and slices would always be accessed by
offset. That appears to be the 'sliding' model you mention.

In Ada the right side slides, the left does not. Sliding the right side allows doing logical things like:

S1 (1..5) := S1 (5..9); -- 5..9 slides to 1..5

I don't see any sliding there, only that chars 5 to 9 are copied over
chars 1 to 5 of the same string.

..

I am not sure if sliding constraint might be usable. It is a different
issue to constraining bounds because it involves operations like
assignment. And it is not clear how to implement such a constraint effectively. Most constraints are either static (compile time), or
simple to represent, like bounds or type tags. Sliding might be
implemented as a flag, but then you will have to check it all the time.
Maybe not worth having it as a choice. And it is unclear what is the unconstrained state, sliding or non-sliding? (:-))

Always sliding (if I understand the term)!

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to James Harris on Fri Nov 4 14:58:42 2022

On 03/11/2022 16:48, James Harris wrote:

On 31/10/2022 16:58, David Brown wrote:

On 30/10/2022 19:13, James Harris wrote:

On 30/10/2022 16:21, luserdroog wrote:

On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:

Do you guys have any thoughts on the best ways for strings of
characters
to be stored?

..

5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7 >>>> can be used to format the data to squeeze it into 7 bits.

Interesting idea. It's certainly one I hadn't thought of.

Nor should you - that is a crazy idea. It is massively inefficient,
as well as being inconsistent with everything else.

The model I have chosen (at least, for now) is to have a string indexed logically from zero (so indices do not need to be stored) and, for implementation, delimited by two pointers.

The one downside I am aware of is that it will, at times, require
creation and destruction of a small descriptor. I'll have to see how the approach works out in practice.

There is no universally ideal way to store a string - every method is inefficient for some uses and operations. You just have to find
something that is good enough for typical uses in your language, add
support for connecting to external code (such as exporting C-style
strings), and make it possible for users to implement other types when
it makes more sense for the user code.

..

I would also recommend treating characters and character strings as
something very different from raw bytes and binary blobs. Users want
to do very different things with them, and many of the useful
operations are completely different. Some languages have made the
mistake of conflating the two concepts - it's difficult to fix once
that design flaw is set into a language.

That sounds interesting but I cannot tell what you have in mind.

I mean you should consider a "string" to be a way of holding a sequence
of "character" units which can hold a code unit of UTF-8 (since any
other choice of character encoding is madness).

This should be different from a "byte", which would be an 8-bit unit of
raw memory (ignore the existence of machines that can't address 8-bit
memory units directly - they will never use your language). Arrays of
this type should always be contiguous, and used for raw memory access.
(You may also want larger types - memory16, memory32, memory64, etc., if
that is convenient for efficient usage.)

So when you read a block of data from a file, or send a block to a
network socket, it is an array of bytes - not a string. (You can have high-level abstractions for a "text file" wrapper that can read and
write strings, but that's not fundamental.) If you have an equivalent
of C's "memcpy" function it should use bytes, not any kind of character
type. If you have something like C's "type-based aliasing rules", then
it is bytes that should have the special exception, not characters.

Neither "byte" nor "character" should have any kind of arithmetic
operators - they are not integers. But you will need cast or conversion operations on them.

The concept of "signed char" and "unsigned char" in C is a serious
design flaw. A type designed to hold letters should not have a sign,
and should not be used to hold arbitrary raw, low-level data.

You might also consider not having a character type at all. Python 3
has no character types - "a" is a string, not a character.

One could consider strings as having two categories of operation: those
which involve only the memory used by strings such as allocation, concatenation, insertion, deletion, etc; and those which care about the contents of a string such as capitalisation, comparison, whitespace recognition, parsing, etc. Why could the mechanics not apply to raw
bytes and blobs?

Operations on strings are those that are relevant to strings. It makes
no sense to capitalise a raw binary blob. It makes no sense to have
methods for linking or chaining sets of blobs - these are direct handles
into memory, and chaining, allocation, etc., are higher level operations.

Raw binary buffers require nothing more than an address and a size to
describe them - anything more, and it is too high level. (Again,
there's nothing wrong with providing higher level features and
interfaces, but they have to build on the fundamental ones.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to David Brown on Fri Nov 4 17:50:59 2022

On 04/11/2022 13:58, David Brown wrote:

Neither "byte" nor "character" should have any kind of arithmetic
operators - they are not integers. But you will need cast or conversion operations on them.

Bytes are small integers, typically of u8 type.

I can't see why arithmetic can't be done with them, unless you want a
purer kind of language where arithmetic is only allowed on signed
numbers, and bitwise ops only on unsigned numbers, which is usually
going to be a pain for all concerned.

The concept of "signed char" and "unsigned char" in C is a serious
design flaw. A type designed to hold letters should not have a sign,
and should not be used to hold arbitrary raw, low-level data.

Signed and unsigned chars are not so bad; presumably C intended these to
do the job of a 'byte' type for small integers. So it was just a poor
choice of name. (After all there is no separate type in C for bytes
holding character data.)

What's bad is that third kind: a 'plain char' type, which is
incompatible with both signed and unsigned char, even though it
necessarily needs to be one of the other on a specific platform. It
occurs in no other language, and causes problems within FFI APIs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Bart on Fri Nov 4 21:35:46 2022

On 04/11/2022 17:50, Bart wrote:

On 04/11/2022 13:58, David Brown wrote:

Neither "byte" nor "character" should have any kind of arithmetic
operators - they are not integers. But you will need cast or
conversion operations on them.

Bytes are small integers, typically of u8 type.

I can't see why arithmetic can't be done with them, unless you want a
purer kind of language where arithmetic is only allowed on signed
numbers, and bitwise ops only on unsigned numbers, which is usually
going to be a pain for all concerned.

I think what David means is that arithmetic operations don't apply to characters (even though some languages permit such operations). For
example, neither

'a' * 5

nor even

'R' + 1

have any meaning over the set of characters. Prohibiting arithmetic on
them could be dome but would make classifying and manipulating
characters difficult unless one had a comprehensive set of library
functions such as

is_digit(char)
is_alphanum(locale, char)
is_lower(locale, char)
upper(locale, char)

and many more.

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to James Harris on Fri Nov 4 22:28:00 2022

On 04/11/2022 21:35, James Harris wrote:

On 04/11/2022 17:50, Bart wrote:

On 04/11/2022 13:58, David Brown wrote:

Neither "byte" nor "character" should have any kind of arithmetic
operators - they are not integers. But you will need cast or
conversion operations on them.

Bytes are small integers, typically of u8 type.

I can't see why arithmetic can't be done with them, unless you want a
purer kind of language where arithmetic is only allowed on signed
numbers, and bitwise ops only on unsigned numbers, which is usually
going to be a pain for all concerned.

I think what David means is that arithmetic operations don't apply to characters

I was picking on the 'byte' type; it seems extraordinary that you
shouldn't be allowed to do arithmetic with them. If you can initialise a
byte value with a number like this:

byte a = 123

then it's a number!

(even though some languages permit such operations). For
example, neither

'a' * 5

nor even

'R' + 1

have any meaning over the set of characters.

I actually had such a restriction for a while: char*5 wasn't allowed,
but char+1 was. After all why on earth shouldn't you want the next
character in that alphabet? Why should code like this be made illegal:

a := a * 10 + (c - '0')

Then I realised I shouldn't be telling the programmer what they can and
can't do with characters, as there might be some perfectly valid
use-case that I simply hadn't thought of.

Maybe 'a' * 5 yields the value 'aaaaa' or the string "aaaaa", or this is
some kind on encryption algorithm.

So now they are treated like integers, other than printing an array of
char or pointer to char assumes they are strings.

Prohibiting arithmetic on
them could be dome but would make classifying and manipulating
characters difficult unless one had a comprehensive set of library
functions such as

is_digit(char)
is_alphanum(locale, char)
is_lower(locale, char)
upper(locale, char)

and many more.

As I said, you and I don't know all the possibilites. Of course there
would need to be conversions between char and int, but this can become a nuisance.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to David Brown on Sat Nov 5 11:14:01 2022

On 04/11/2022 13:58, David Brown wrote:

On 03/11/2022 16:48, James Harris wrote:

On 31/10/2022 16:58, David Brown wrote:

..

I would also recommend treating characters and character strings as
something very different from raw bytes and binary blobs. Users want
to do very different things with them, and many of the useful
operations are completely different. Some languages have made the
mistake of conflating the two concepts - it's difficult to fix once
that design flaw is set into a language.

That sounds interesting but I cannot tell what you have in mind.

I mean you should consider a "string" to be a way of holding a sequence
of "character" units which can hold a code unit of UTF-8 (since any
other choice of character encoding is madness).

We've discussed before that (IMO) Unicode is useful for physical
printing to paper or electronic rendering such as to PDF but that it's a nightmare for programmers and users when it is used for any kind of
input so I won't go over that again except to say that AISI Unicode
should be handled by library functions rather than a language.

What I do have in mind is strings of 'containers' where a string might
be declared as of type

string of char8 -- meaning a string of char8 containers
string of char32 -- meaning a string of char32 containers

What goes in each 8-bit or 32-bit 'container' would be another matter.

That agnostic ideal is somewhat in tension with the desire to include
string literals in a program text. For that, as I've mentioned before,
my preference is to have the program text and any literals within it
written in ASCII and American English; supplementary files would express
the string literals in other languages.

For example,

print "Hello world"

would be accompanied by a file for French which included

"Hello world" --> "Bonjour le monde"

Naturally, multilingual programming is much more complex than that
simple example but it shows the basic idea. The compiler would be able
to check that language files had everything required for a given piece
of source code.

..

Neither "byte" nor "character" should have any kind of arithmetic
operators - they are not integers. But you will need cast or conversion operations on them.

The char8 and char32 containers could omit support for arithmetic **if**
enough support routines were provided. But, as Bart says, it's difficult
to anticipate all such support routines that programmers might need.

The concept of "signed char" and "unsigned char" in C is a serious
design flaw. A type designed to hold letters should not have a sign,
and should not be used to hold arbitrary raw, low-level data.

OK.

You might also consider not having a character type at all. Python 3
has no character types - "a" is a string, not a character.

I can see that as being possible. I keep coming across examples of where
a 1-character string would do just as well as a character. ATM, though,
I have a separate character type.

..

Raw binary buffers require nothing more than an address and a size to describe them - anything more, and it is too high level. (Again,
there's nothing wrong with providing higher level features and
interfaces, but they have to build on the fundamental ones.)

Maybe that's where I am going with this. What you describe does sound
rather like my idea of using chars as containers and putting char and
string handling in libraries.

Incidentally, AISI I could have strings of any data type, not just
chars. For example,

string of int16
string of struct {x: float, y: float}
string of function(int, bool) -> uint

I'll have to see how it works out in practice but the idea is to
separate the concept of a string (basically, storage layout) from the
concept of whatever type of element the string is made from.

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Bart on Sat Nov 5 10:36:33 2022

On 04/11/2022 22:28, Bart wrote:

On 04/11/2022 21:35, James Harris wrote:

On 04/11/2022 17:50, Bart wrote:

On 04/11/2022 13:58, David Brown wrote:

Neither "byte" nor "character" should have any kind of arithmetic
operators - they are not integers. But you will need cast or
conversion operations on them.

Bytes are small integers, typically of u8 type.

I can't see why arithmetic can't be done with them, unless you want a
purer kind of language where arithmetic is only allowed on signed
numbers, and bitwise ops only on unsigned numbers, which is usually
going to be a pain for all concerned.

I think what David means is that arithmetic operations don't apply to
characters

I was picking on the 'byte' type; it seems extraordinary that you
shouldn't be allowed to do arithmetic with them. If you can initialise a
byte value with a number like this:

    byte a = 123

then it's a number!

(even though some languages permit such operations). For example, neither

   'a' * 5

nor even

   'R' + 1

have any meaning over the set of characters.

I actually had such a restriction for a while: char*5 wasn't allowed,
but char+1 was. After all why on earth shouldn't you want the next
character in that alphabet?

That's because 'R' + 1 may not be the next character in all alphabets.
Defining 'next' is more than difficult. It depends on intended collation
order which varies in different parts of the world and can even change
over time as authorities choose different collation orders. Some
plausible meanings of 'R' + 1:

'S' (as in ASCII)
'r' (what the user may want as sort order)
's' (what the user may want as sort order)
a non-character (as in EBCDIC)

Perhaps a pseudo-call would be better such as

char_plus(collation, 'R', 1)

where 'collation' would be used to determine what was the specified
number of characters away from 'R'.

Why should code like this be made illegal:

    a := a * 10 + (c - '0')

I'm not saying it should. I am in two minds about what's best. The
alternative is something like

a := a * 10 + digit_value(c)

Then I realised I shouldn't be telling the programmer what they can and
can't do with characters, as there might be some perfectly valid
use-case that I simply hadn't thought of.

Agreed. Although the point of prohibiting arithmetic on characters is to
make multilingual programming easier, not harder.

..

Prohibiting arithmetic on them could be dome but would make
classifying and manipulating characters difficult unless one had a
comprehensive set of library functions such as

   is_digit(char)
   is_alphanum(locale, char)
   is_lower(locale, char)
   upper(locale, char)

and many more.

As I said, you and I don't know all the possibilites.

Yes, the challenge of making multilingual programming easier is
providing a comprehensive and convenient set of conversions.

Of course there
would need to be conversions between char and int, but this can become a nuisance.

Aside from converting digits (and any other characters used as digits in
a higher number base) is there's any meaning to converting chars to/from
ints?

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to James Harris on Sat Nov 5 13:46:33 2022

On 05/11/2022 11:14, James Harris wrote:

On 04/11/2022 13:58, David Brown wrote:

I mean you should consider a "string" to be a way of holding a
sequence of "character" units which can hold a code unit of UTF-8
(since any other choice of character encoding is madness).

We've discussed before that (IMO) Unicode is useful for physical
printing to paper or electronic rendering such as to PDF but that it's a nightmare for programmers and users when it is used for any kind of
input so I won't go over that again except to say that AISI Unicode
should be handled by library functions rather than a language.

What I do have in mind is strings of 'containers' where a string might
be declared as of type

string of char8 -- meaning a string of char8 containers
string of char32 -- meaning a string of char32 containers

What goes in each 8-bit or 32-bit 'container' would be another matter.

That agnostic ideal is somewhat in tension with the desire to include
string literals in a program text. For that, as I've mentioned before,
my preference is to have the program text and any literals within it
written in ASCII and American English; supplementary files would express
the string literals in other languages.

For example,

print "Hello world"

would be accompanied by a file for French which included

"Hello world" --> "Bonjour le monde"

Naturally, multilingual programming is much more complex than that
simple example but it shows the basic idea. The compiler would be able
to check that language files had everything required for a given piece
of source code.

Is it? This pretty much all I did when I used to write internationalised applications. Although that was only done for French, German and Dutch.

But that print example would be written like this:

print /"Hello World"

The "/" was a translation operator, so only certain strings were
translated. This also made it easy to scan source code to build a list
of messages, used to maintain the dictionary as entries were added,
deleted or modified.

The scheme did need some hints sometimes, written like this, to get
around ambiguities:

print /"Green!colour"
print /"Green!fresh"

The hint was usually filtered out.

But this is little to do with how strings are represented. Even in
English, messages may include characters like "£" (pound sign) which is
not part of ASCII.

So a way to represent Unicode within literals is still needed (didn't we discuss this a couple of years ago?).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bart@21:1/5 to James Harris on Sat Nov 5 13:38:16 2022

On 05/11/2022 10:36, James Harris wrote:

On 04/11/2022 22:28, Bart wrote:

I actually had such a restriction for a while: char*5 wasn't allowed,
but char+1 was. After all why on earth shouldn't you want the next
character in that alphabet?

That's because 'R' + 1 may not be the next character in all alphabets. Defining 'next' is more than difficult. It depends on intended collation order which varies in different parts of the world and can even change
over time as authorities choose different collation orders. Some
plausible meanings of 'R' + 1:

'S' (as in ASCII)

You said elsewhere that you want to use ASCII within programs. Which is
it happens, corresponds to the first 128 points of Unicode. Here:

char c
for c in 'A'..'Z' do
print c
od

this displays 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'. The for-loop works by adding
+1 to 'c'; it doesn't care about collating order!

(This also illustrates the difference between `byte` and `char`; using a
byte type, the output would be '656667...90'.)

'r' (what the user may want as sort order)
's' (what the user may want as sort order)
a non-character (as in EBCDIC)

My feeling is that it is these diverse requirements that require
user-supplied functions.

My 'char' is still a thinly veiled numeric type, so ordinary integer
arithmatic can be used. Otherwise even something like this becomes
impossible:

['A'..'Z']int histogram

midpoint := (histogram.upb - histogram.lwb)/2

++histogram[midpoint+1]

This requires certain properties of array indicates, like being able to
do arithmetic, as well as being consecutive ordinal values.

Perhaps a pseudo-call would be better such as

char_plus(collation, 'R', 1)

where 'collation' would be used to determine what was the specified
number of characters away from 'R'.

Sure, as I said, you can provide any interpretation you like. But if you
do C+1, you expect to get the code of the next character (or next
codepoint if venturing outside ASCII).

Aside from converting digits (and any other characters used as digits in
a higher number base) is there's any meaning to converting chars to/from ints?

My static language makes byte and char slightly different types. (Types involving char may get printed differently.)

That meant that `ref byte` and `ref char` were incompatible, which
rapidly turned into a nightmare: I might have a readfile() routine that returned a `ref byte` type, a pointer to a block of memory.

But I wanted to interpret that block as `ref char` - a string. So this
meant loads of casts to either `ref byte` or `ref char` to get things to
work, but it got too much (a bit like 'const poisoning' in C where it
just propagates everywhere). That was clearly the wrong approach.

In the end I relaxed the type rules so that `ref byte` and `ref char`
are compatible, and everything is now SO much simpler.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Bart on Sat Nov 5 16:07:30 2022

On 05/11/2022 13:38, Bart wrote:

On 05/11/2022 10:36, James Harris wrote:

On 04/11/2022 22:28, Bart wrote:

I actually had such a restriction for a while: char*5 wasn't allowed,
but char+1 was. After all why on earth shouldn't you want the next
character in that alphabet?

That's because 'R' + 1 may not be the next character in all alphabets.
Defining 'next' is more than difficult. It depends on intended
collation order which varies in different parts of the world and can
even change over time as authorities choose different collation
orders. Some plausible meanings of 'R' + 1:

   'S' (as in ASCII)

You said elsewhere that you want to use ASCII within programs. Which is
it happens, corresponds to the first 128 points of Unicode. Here:

    char c
    for c in 'A'..'Z' do
        print c
    od

this displays 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'. The for-loop works by adding
+1 to 'c'; it doesn't care about collating order!

That piece of code is fine for an English-speaking user but to a
Spaniard the alphabet has a character missing, and Greeks wouldn't agree
with it at all.

Where L is the locale why not allow something like

for c in L.alpha_first..L.alpha_last
print c
od

?

That should work in English or Spanish or Greek etc, shouldn't it?

(This also illustrates the difference between `byte` and `char`; using a
byte type, the output would be '656667...90'.)

   'r' (what the user may want as sort order)
   's' (what the user may want as sort order)
   a non-character (as in EBCDIC)

My feeling is that it is these diverse requirements that require user-supplied functions.

Functions, yes, (or what appear to be functions) though surely they
should be part of a library that comes with the language.

My 'char' is still a thinly veiled numeric type, so ordinary integer arithmatic can be used. Otherwise even something like this becomes impossible:

    ['A'..'Z']int histogram

    midpoint := (histogram.upb - histogram.lwb)/2

    ++histogram[midpoint+1]

This requires certain properties of array indicates, like being able to
do arithmetic, as well as being consecutive ordinal values.

Why not

[L.alpha_first..L.alpha_last] int histogram

?

As for the calculations what about using L.ord and L.chr to convert
between chars and integers?

Perhaps a pseudo-call would be better such as

   char_plus(collation, 'R', 1)

where 'collation' would be used to determine what was the specified
number of characters away from 'R'.

Sure, as I said, you can provide any interpretation you like. But if you
do C+1, you expect to get the code of the next character (or next
codepoint if venturing outside ASCII).

If you use codepoints then you might not get the next character in
sequence - as in the case of 'R' in ebcdic (you'd get a non-printing
character) or 'N' in Spanish (you'd get 'O' rather than the N with a hat
that a Spaniard would expect).

If the programmer wants "the next character in the alphabet" then
shouldn't the programming language or a standard library help him get
that irrespective of the human language the program is meant to be
processing?

Aside from converting digits (and any other characters used as digits
in a higher number base) is there's any meaning to converting chars
to/from ints?

My static language makes byte and char slightly different types. (Types involving char may get printed differently.)

That meant that `ref byte` and `ref char` were incompatible, which
rapidly turned into a nightmare: I might have a readfile() routine that returned a `ref byte` type, a pointer to a block of memory.

But I wanted to interpret that block as `ref char` - a string. So this
meant loads of casts to either `ref byte` or `ref char` to get things to work, but it got too much (a bit like 'const poisoning' in C where it
just propagates everywhere). That was clearly the wrong approach.

C's const propagation sounds like Java with its horrible, and sticky,
exception propagation.

In the end I relaxed the type rules so that `ref byte` and `ref char`
are compatible, and everything is now SO much simpler.

Would there have been any value in defining a layout for the untyped
area of bytes (or parts thereof)? That's where I think I am headed.

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to James Harris on Sat Nov 5 17:51:01 2022

On 04/11/2022 22:35, James Harris wrote:

On 04/11/2022 17:50, Bart wrote:

On 04/11/2022 13:58, David Brown wrote:

Neither "byte" nor "character" should have any kind of arithmetic
operators - they are not integers. But you will need cast or
conversion operations on them.

Bytes are small integers, typically of u8 type.

I can't see why arithmetic can't be done with them, unless you want a
purer kind of language where arithmetic is only allowed on signed
numbers, and bitwise ops only on unsigned numbers, which is usually
going to be a pain for all concerned.

I think what David means is that arithmetic operations don't apply to characters (even though some languages permit such operations). For
example, neither

'a' * 5

nor even

'R' + 1

have any meaning over the set of characters. Prohibiting arithmetic on
them could be dome but would make classifying and manipulating
characters difficult unless one had a comprehensive set of library
functions such as

is_digit(char)
is_alphanum(locale, char)
is_lower(locale, char)
upper(locale, char)

and many more.

Yes.

You will, of course, need some kind of explicit conversion between strings/characters and integers or bytes. But if you want operations
like "is_digit" or "upper" for more than plain ASCII, you need a
comprehensive library. (As a fine example, the uppercase of "i" in
English is the letter "I", while in Turkish it is the letter "İ".)

There are plenty of internationalisation libraries available - you
should be able to find something suitable (perhaps
<https://icu.unicode.org>) and make wrappers and interfaces to your new language.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Bart on Sat Nov 5 17:41:59 2022

On 04/11/2022 18:50, Bart wrote:

On 04/11/2022 13:58, David Brown wrote:

Neither "byte" nor "character" should have any kind of arithmetic
operators - they are not integers. But you will need cast or
conversion operations on them.

Bytes are small integers, typically of u8 type.

That's true in some languages. In other languages they are character
types. (In C, they are both.)

My suggestion is that they should not be considered integers or
characters, but a low-level "raw data" type. A "byte" should represent
a byte of memory, or an octet in a network packet, or a byte of a file.
It doesn't make sense to do any arithmetic on it because it might be
part of some completely different data type.

A given "byte" might be part of the storage of a floating point number,
or the storage of an address or pointer, or anything else.

I've said this before - the strength of a programming language is mainly determined by what you /cannot/ do, rather than what you /can/ do.
Design a language to make it as hard as possible to get things wrong,
while still being easy to do the things you want it to do. Keeping
integer types, character types, strings, and raw data independent can
help with that.

I can't see why arithmetic can't be done with them, unless you want a
purer kind of language where arithmetic is only allowed on signed
numbers, and bitwise ops only on unsigned numbers, which is usually
going to be a pain for all concerned.

Whether you have signed and unsigned integer types, what sizes you have, whether you have an abstract "integer" type and explicitly ranged
subtypes (as in Ada) is another matter.

The concept of "signed char" and "unsigned char" in C is a serious
design flaw. A type designed to hold letters should not have a sign,
and should not be used to hold arbitrary raw, low-level data.

Signed and unsigned chars are not so bad;

Yes, they are bad. A "character" is a letter or other visual symbol.
Can you explain the difference between "a positive letter X" and "a
negative letter X" ? Of course you can't - it is utter nonsense. The
same goes for adding 6 to the Greek letter µ, or multiplying Ð by Å.
Even operations that might appear sensible in code, such as adding 1 to
a char, don't actually make sense - it is the operation of taking the
next letter in the alphabet that makes sense.

presumably C intended these to
do the job of a 'byte' type for small integers. So it was just a poor
choice of name. (After all there is no separate type in C for bytes
holding character data.)

What's bad is that third kind: a 'plain char' type, which is
incompatible with both signed and unsigned char, even though it
necessarily needs to be one of the other on a specific platform. It
occurs in no other language, and causes problems within FFI APIs.

Certainly having three different "char" types in C is bad. But the only sensible choice is to have nothing but a plain "char" and use /integer/
types for numeric data. (Call them "u8" and "i8" if you prefer, rather
than the C names "uint8_t" and "int8_t".)

(These days I would consider not having any kind of character type at
all, unless it was a language targeting small embedded systems that need maximal efficiency - by the time you have something that can hold any
UTF-8 character, you might as well just call it a "string".)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to Bart on Sat Nov 5 16:18:49 2022

On 05/11/2022 13:46, Bart wrote:

On 05/11/2022 11:14, James Harris wrote:

..

For example,

   print "Hello world"

would be accompanied by a file for French which included

   "Hello world" --> "Bonjour le monde"

Naturally, multilingual programming is much more complex than that
simple example but it shows the basic idea. The compiler would be able
to check that language files had everything required for a given piece
of source code.

Is it? This pretty much all I did when I used to write internationalised applications. Although that was only done for French, German and Dutch.

I imagine multilingual programming would be very difficult and that it's something a language should help with but it's not something I have had
to do as yet.

But that print example would be written like this:

    print /"Hello World"

The "/" was a translation operator, so only certain strings were
translated. This also made it easy to scan source code to build a list
of messages, used to maintain the dictionary as entries were added,
deleted or modified.

The scheme did need some hints sometimes, written like this, to get
around ambiguities:

    print /"Green!colour"
    print /"Green!fresh"

That's cool. I have something similar which is trailing identifiers but
I had them down as specific rather than hints. For example,

print "Green" :GreenColor
print "Green" :GreenFresh

Then the accompanying language files could distinguish between the strings.

The hint was usually filtered out.

But this is little to do with how strings are represented. Even in
English, messages may include characters like "£" (pound sign) which is
not part of ASCII.

I have two ways to deal with that. If the program is to use a pound sign
in England but a Dollar sign in America, say, then the source would have
a dollar sign and there'd be a file to translate for use in England.

On the other hand, if the program was to use a pound sign in all cases
then I'd do as below.

So a way to represent Unicode within literals is still needed (didn't we discuss this a couple of years ago?).

Yes. You may remember my preference was for a named character something like

pound_currency_string = "\PoundSterling/"

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Bart on Sat Nov 5 18:01:17 2022

On 04/11/2022 23:28, Bart wrote:

On 04/11/2022 21:35, James Harris wrote:

On 04/11/2022 17:50, Bart wrote:

On 04/11/2022 13:58, David Brown wrote:

Neither "byte" nor "character" should have any kind of arithmetic
operators - they are not integers. But you will need cast or
conversion operations on them.

Bytes are small integers, typically of u8 type.

I can't see why arithmetic can't be done with them, unless you want a
purer kind of language where arithmetic is only allowed on signed
numbers, and bitwise ops only on unsigned numbers, which is usually
going to be a pain for all concerned.

I think what David means is that arithmetic operations don't apply to
characters

I was picking on the 'byte' type; it seems extraordinary that you
shouldn't be allowed to do arithmetic with them. If you can initialise a
byte value with a number like this:

    byte a = 123

then it's a number!

If I were making the language, then you could not do such an
initialisation. A "byte" is raw data, not a number.

(even though some languages permit such operations). For example, neither

   'a' * 5

nor even

   'R' + 1

have any meaning over the set of characters.

I actually had such a restriction for a while: char*5 wasn't allowed,
but char+1 was. After all why on earth shouldn't you want the next
character in that alphabet? Why should code like this be made illegal:

    a := a * 10 + (c - '0')

Why not :

char a; // Use whatever syntax you prefer
int i; // and whatever type names you prefer

a = digit(i);

The function "digit" might be defined :

char digit(int i) {
return char(i + ord('0'));
}

You want to find the next letter after "x"? "char(ord(x) + 1)". Or
perhaps, like Pascal and Ada, "succ(x)".

A language has to let you do what you need to do - but you should be
required to write it /clearly/. It should not let you mix apples and
oranges without you saying exactly how you think apples and oranges
should be mixed in the given situation.

Then I realised I shouldn't be telling the programmer what they can and
can't do with characters, as there might be some perfectly valid
use-case that I simply hadn't thought of.

Maybe 'a' * 5 yields the value 'aaaaa' or the string "aaaaa", or this is
some kind on encryption algorithm.

You've hit the nail on the head. What does it mean to write "'a' * 5" ?
To some people it means one thing, to others it means something
different, and to most people in most circumstances it is meaningless. Meaningless code should be a compile-time error. And you let the
programmer write /explicit/ code to say what he/she intends in other cases.

It is only if something is being used so often that explicit code is a
pain to read and write, that you should do anything implicitly here. So
if you are writing a language designed primarily for string processing,
you might consider having "'a' * 5" defined to be "aaaaa". Otherwise,
let the programmer write "repeat('a', 5)" or "'a'.repeat(5)", or
whatever suits the style of the language.

So now they are treated like integers, other than printing an array of
char or pointer to char assumes they are strings.

Prohibiting arithmetic on them could be dome but would make
classifying and manipulating characters difficult unless one had a
comprehensive set of library functions such as

   is_digit(char)
   is_alphanum(locale, char)
   is_lower(locale, char)
   upper(locale, char)

and many more.

As I said, you and I don't know all the possibilites. Of course there
would need to be conversions between char and int, but this can become a nuisance.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to James Harris on Sat Nov 5 18:08:37 2022

On 05/11/2022 17:07, James Harris wrote:

On 05/11/2022 13:38, Bart wrote:

On 05/11/2022 10:36, James Harris wrote:

On 04/11/2022 22:28, Bart wrote:

I actually had such a restriction for a while: char*5 wasn't
allowed, but char+1 was. After all why on earth shouldn't you want
the next character in that alphabet?

That's because 'R' + 1 may not be the next character in all
alphabets. Defining 'next' is more than difficult. It depends on
intended collation order which varies in different parts of the world
and can even change over time as authorities choose different
collation orders. Some plausible meanings of 'R' + 1:

   'S' (as in ASCII)

You said elsewhere that you want to use ASCII within programs. Which
is it happens, corresponds to the first 128 points of Unicode. Here:

     char c
     for c in 'A'..'Z' do
         print c
     od

this displays 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'. The for-loop works by
adding +1 to 'c'; it doesn't care about collating order!

That piece of code is fine for an English-speaking user but to a
Spaniard the alphabet has a character missing, and Greeks wouldn't agree
with it at all.

Where L is the locale why not allow something like

for c in L.alpha_first..L.alpha_last
    print c
od

?

That should work in English or Spanish or Greek etc, shouldn't it?

Yes. Of course, it would not work so well in Chinese (where there is no concept of alphabet), and would be a challenge for Arabic (where the
shape of letters varies enormously depending on where they come in
words). But it is definitely a step in the right direction. And makes
the code independent of the character encoding.

(This also illustrates the difference between `byte` and `char`; using
a byte type, the output would be '656667...90'.)

   'r' (what the user may want as sort order)
   's' (what the user may want as sort order)
   a non-character (as in EBCDIC)

My feeling is that it is these diverse requirements that require
user-supplied functions.

Functions, yes, (or what appear to be functions) though surely they
should be part of a library that comes with the language.

How much of a library you provide depends on the goals of the language.
You don't have to provide /everything/ !

My 'char' is still a thinly veiled numeric type, so ordinary integer
arithmatic can be used. Otherwise even something like this becomes
impossible:

     ['A'..'Z']int histogram

     midpoint := (histogram.upb - histogram.lwb)/2

     ++histogram[midpoint+1]

This requires certain properties of array indicates, like being able
to do arithmetic, as well as being consecutive ordinal values.

Why not

[L.alpha_first..L.alpha_last] int histogram

?

As for the calculations what about using L.ord and L.chr to convert
between chars and integers?

Perhaps a pseudo-call would be better such as

   char_plus(collation, 'R', 1)

where 'collation' would be used to determine what was the specified
number of characters away from 'R'.

Sure, as I said, you can provide any interpretation you like. But if
you do C+1, you expect to get the code of the next character (or next
codepoint if venturing outside ASCII).

If you use codepoints then you might not get the next character in
sequence - as in the case of 'R' in ebcdic (you'd get a non-printing character) or 'N' in Spanish (you'd get 'O' rather than the N with a hat
that a Spaniard would expect).

If the programmer wants "the next character in the alphabet" then
shouldn't the programming language or a standard library help him get
that irrespective of the human language the program is meant to be processing?

Aside from converting digits (and any other characters used as digits
in a higher number base) is there's any meaning to converting chars
to/from ints?

My static language makes byte and char slightly different types.
(Types involving char may get printed differently.)

That meant that `ref byte` and `ref char` were incompatible, which
rapidly turned into a nightmare: I might have a readfile() routine
that returned a `ref byte` type, a pointer to a block of memory.

But I wanted to interpret that block as `ref char` - a string. So this
meant loads of casts to either `ref byte` or `ref char` to get things
to work, but it got too much (a bit like 'const poisoning' in C where
it just propagates everywhere). That was clearly the wrong approach.

C's const propagation sounds like Java with its horrible, and sticky, exception propagation.

Getting "const" right is something to think long and hard about. When
do you mean "constant", when do you mean "read-only", when do you mean
"I promise this data will never change", "I will assume this data will
never change", "I promise /I/ won't change this data via this
reference", "This data will be unchanged logically but may change in
underlying representation, such as using a cache of some sort", etc. ?

Constness is a hugely powerful concept, and something you definitely
want in a language. Modern language design fashion is to making things constant be default and require explicit indication that they can
change. Some programming languages (pure functional programming
languages, for example) have /only/ constant data - there is no such
thing as variables.

In the end I relaxed the type rules so that `ref byte` and `ref char`
are compatible, and everything is now SO much simpler.

Would there have been any value in defining a layout for the untyped
area of bytes (or parts thereof)? That's where I think I am headed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to David Brown on Sun Nov 6 09:28:27 2022

On 05/11/2022 17:01, David Brown wrote:

On 04/11/2022 23:28, Bart wrote:

..

I actually had such a restriction for a while: char*5 wasn't allowed,
but char+1 was. After all why on earth shouldn't you want the next
character in that alphabet? Why should code like this be made illegal:

     a := a * 10 + (c - '0')

Why not :

    char a;        // Use whatever syntax you prefer
    int i;        // and whatever type names you prefer

    a = digit(i);

The function "digit" might be defined :

    char digit(int i) {
        return char(i + ord('0'));
    }

Wouldn't char and ord need a locale?

That may be the wrong term but by locale I mean a bundled set of rules (including, in this case, what the digits are and how many there are of
them) which apply to the language and region the program is executing for.

Maybe it is the right term. I see on Wikipedia: "In computing, a locale
is a set of parameters that defines the user's language, region and any
special variant preferences that the user wants to see in their user interface."

https://en.wikipedia.org/wiki/Locale_(computer_software)

You want to find the next letter after "x"? "char(ord(x) + 1)". Or perhaps, like Pascal and Ada, "succ(x)".

pred and succ are great, and I was thinking to start a thread on how
they might be used for different data types. But I have to point out
that they are not enough on their own. If a user wanted the character
ten away from the current one then he wouldn't want to code ten succ operations.

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to David Brown on Sun Nov 6 09:17:24 2022

On 05/11/2022 17:08, David Brown wrote:

On 05/11/2022 17:07, James Harris wrote:

On 05/11/2022 13:38, Bart wrote:

..

My feeling is that it is these diverse requirements that require
user-supplied functions.

Functions, yes, (or what appear to be functions) though surely they
should be part of a library that comes with the language.

How much of a library you provide depends on the goals of the language.
You don't have to provide /everything/ !

That's good to hear. :)

While I am trying to make it easy to invoke functions written by other
people ISTM that it's also right for the language to have associated
with it a load of standard provisions - i18n support, various data
structures, display support, maths libraries, etc for one simple reason:
code maintenance; it's easier to maintain software which uses library
calls one already knows than to have to learn yet another set of i18n
calls, for example.

..

C's const propagation sounds like Java with its horrible, and sticky,
exception propagation.

Getting "const" right is something to think long and hard about. When
do you mean "constant", when do you mean "read-only", when do you mean
"I promise this data will never change", "I will assume this data will
never change", "I promise /I/ won't change this data via this
reference", "This data will be unchanged logically but may change in underlying representation, such as using a cache of some sort", etc. ?

That sounds really interesting and I'd like to get in to it but this is
not the thread. If you wanted to start a new thread on the topic I would
reply. Suffice to say here that I don't use "const" but do have "ro" and
"rw" as usable in various contexts which effect many of the things you
mention but I don't know if I have covered everything a programmer might
need.

Constness is a hugely powerful concept, and something you definitely
want in a language. Modern language design fashion is to making things constant be default and require explicit indication that they can
change. Some programming languages (pure functional programming
languages, for example) have /only/ constant data - there is no such
thing as variables.

I haven't gone that far but, for example, I have globals as, by default,
read only and while parameters are read-write a function would have to
keep the originals around if there's a chance they would be needed.

As I say, though, such things need a thread of their own so I'll resist
the urge to say more.

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to James Harris on Sun Nov 6 12:10:07 2022

On 06/11/2022 10:17, James Harris wrote:

On 05/11/2022 17:08, David Brown wrote:

On 05/11/2022 17:07, James Harris wrote:

On 05/11/2022 13:38, Bart wrote:

..

My feeling is that it is these diverse requirements that require
user-supplied functions.

Functions, yes, (or what appear to be functions) though surely they
should be part of a library that comes with the language.

How much of a library you provide depends on the goals of the
language. You don't have to provide /everything/ !

That's good to hear. :)

While I am trying to make it easy to invoke functions written by other
people ISTM that it's also right for the language to have associated
with it a load of standard provisions - i18n support, various data structures, display support, maths libraries, etc for one simple reason:
code maintenance; it's easier to maintain software which uses library
calls one already knows than to have to learn yet another set of i18n
calls, for example.

Sure. But pick one i18n library, write the FFI wrapper, and call that
your standard library. Then users don't have to deal with third-party
code and libraries, and you don't have to learn the intricacies of how
to support multiple languages properly (you don't have enough lifetimes
to learn enough to write it yourself). Everyone wins!

..

C's const propagation sounds like Java with its horrible, and sticky,
exception propagation.

Getting "const" right is something to think long and hard about. When
do you mean "constant", when do you mean "read-only", when do you mean
"I promise this data will never change", "I will assume this data will
never change", "I promise /I/ won't change this data via this
reference", "This data will be unchanged logically but may change in
underlying representation, such as using a cache of some sort", etc. ?

That sounds really interesting and I'd like to get in to it but this is
not the thread. If you wanted to start a new thread on the topic I would reply. Suffice to say here that I don't use "const" but do have "ro" and
"rw" as usable in various contexts which effect many of the things you mention but I don't know if I have covered everything a programmer might need.

Constness is a hugely powerful concept, and something you definitely
want in a language. Modern language design fashion is to making
things constant be default and require explicit indication that they
can change. Some programming languages (pure functional programming
languages, for example) have /only/ constant data - there is no such
thing as variables.

I haven't gone that far but, for example, I have globals as, by default,
read only and while parameters are read-write a function would have to
keep the originals around if there's a chance they would be needed.

As I say, though, such things need a thread of their own so I'll resist
the urge to say more.

Fair enough. It's a big topic, and deserves its own thread. All I will
do here is encourage you to think hard about it, learn about it, and
test ideas early on - if you try to add "const" to a language later, it
will inevitably be a mess, complicated and inconsistent.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to James Harris on Sun Nov 6 12:23:32 2022

On 06/11/2022 10:28, James Harris wrote:

On 05/11/2022 17:01, David Brown wrote:

On 04/11/2022 23:28, Bart wrote:

..

I actually had such a restriction for a while: char*5 wasn't allowed,
but char+1 was. After all why on earth shouldn't you want the next
character in that alphabet? Why should code like this be made illegal:

     a := a * 10 + (c - '0')

Why not :

     char a;        // Use whatever syntax you prefer
     int i;        // and whatever type names you prefer

     a = digit(i);

The function "digit" might be defined :

     char digit(int i) {
         return char(i + ord('0'));
     }

Wouldn't char and ord need a locale?

That would depend on what you are trying to do. If you wanted a real multi-lingual "digit" function, then yes - and you'd return different
Unicode characters for different languages. But it is also important to
have some way of getting to the underlying representation of the
characters. At a minimum, you'll need that to implement the library of functions for dealing with locales. (Maybe you want to distinguish
between "low-level" or "library implementation" code that is allowed to
do such things, and "user" code that is not - just as some languages
have "safe" and "unsafe" code modes.)

That may be the wrong term but by locale I mean a bundled set of rules (including, in this case, what the digits are and how many there are of
them) which apply to the language and region the program is executing for.

Maybe it is the right term. I see on Wikipedia: "In computing, a locale
is a set of parameters that defines the user's language, region and any special variant preferences that the user wants to see in their user interface."

https://en.wikipedia.org/wiki/Locale_(computer_software)

You want to find the next letter after "x"? "char(ord(x) + 1)". Or
perhaps, like Pascal and Ada, "succ(x)".

pred and succ are great, and I was thinking to start a thread on how
they might be used for different data types. But I have to point out
that they are not enough on their own. If a user wanted the character
ten away from the current one then he wouldn't want to code ten succ operations.

You could give "pred" and "succ" an optional step argument.

You will also have to think about what happens if the result doesn't
make sense - if you step beyond the range for the type, do you throw an
error of some sort? Should some kinds of types have wrapping succ/pred operations?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Harris@21:1/5 to David Brown on Sun Nov 13 20:49:22 2022

On 06/11/2022 11:10, David Brown wrote:

On 06/11/2022 10:17, James Harris wrote:

On 05/11/2022 17:08, David Brown wrote:

...

How much of a library you provide depends on the goals of the
language. You don't have to provide /everything/ !

That's good to hear. :)

While I am trying to make it easy to invoke functions written by other
people ISTM that it's also right for the language to have associated
with it a load of standard provisions - i18n support, various data
structures, display support, maths libraries, etc for one simple
reason: code maintenance; it's easier to maintain software which uses
library calls one already knows than to have to learn yet another set
of i18n calls, for example.

Sure. But pick one i18n library, write the FFI wrapper, and call that
your standard library. Then users don't have to deal with third-party
code and libraries, and you don't have to learn the intricacies of how
to support multiple languages properly (you don't have enough lifetimes
to learn enough to write it yourself). Everyone wins!

Yes, internationalisation is an enormous area. I don't have anything
like the requisite knowledge of the world's languages and character sets
to do the work. What I /can/ do, AISI, is to establish principles and restrictions intended to make multilingual programming more natural. For example,

* To define the standard form of strings to have unadorned characters
stored as separate codes from diacritics (which I gather may be called combining characters).

* Diacritic codes would all have to be stored in a certain order
relative to each other.

* String encodings would put diacritics before the character to which
they apply. (Though am not sure what to do about accents which apply to
whole words or groups of characters.)

* The language would not support ordering of characters based on their
internal codes. All ordering would require a locale to indicate which
should come first.

* String ordering would require creation of a 'comparison string'
created according to the rules of a selected locale. The internal codes
of the comparison string /would/ be comparable for orde5ring.

* There would be a standard API which all i18n libraries would have to
support.

etc

Whether that kind of approach is valid or not, I don't know. It's just
my best guess at what may be required.

BTW, emoticons are a complete nightmare. There could be any number of
them, they may render differently on different devices and no ordering
of them makes sense.

--
James Harris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 07:56:03 2025
  from Rognac, France via SSH
- Gretchiie
  Sat Sep 13 07:22:10 2025
  from Derry, Nh via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (0 / 16)
Uptime:	169:21:07
Calls:	10,385
Calls today:	2
Files:	14,057
Messages:	6,416,551

Storing strings

Who's Online

Recent Visitors

System Info