• Storing strings

    From James Harris@21:1/5 to All on Mon Oct 24 11:31:11 2022
    Do you guys have any thoughts on the best ways for strings of characters
    to be stored?

    1. There's the C way, of course, of reserving one value (zero) and using
    it as a terminator.

    2. There's the 'length prefix' option of putting the length of the
    string in a machine word before the characters.

    3. There's the 'double pointer' way of pointing at, say, first and past
    (where 'past' means first plus length such that the second pointer
    points one position beyond the last character).

    Any others?

    Options 1 and 2 have the advantage that they can be referred to simply
    by address. Option 3 needs an additional place in which to store the
    (first, past) control block.

    Option 1 has the advantage that it's easy for a program to process (by
    either pointer or index).

    Options 1 and 3 have the advantage that one can refer to the tail of the
    string (anything past the first character) without creating a copy,
    although option 3 would need a new control block to be created. Option 2
    would require a new string to be created.

    In fact, option 3 has the advantage that it allows any continuous
    substring - head, mid, or tail - to be referred to without making a copy
    of the required part of the string.

    Options 2 and 3 make it fast to find the length. They also allow any
    value (i.e. including zero) to be part of the string.

    So: Which of those should a compiler support? Should it support more
    than one form? If so, should the language allow the programmer to
    specify which form to use on any particular string?

    If that's not complicated enough, the above essentially considers
    strings whose contents could be read-only or read-write but their
    lengths don't change. If the lengths can change then there are
    additional issues of storage management. Eek! ;)

    Recommendations welcome!


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to James Harris on Mon Oct 24 11:44:38 2022
    James Harris <james.harris.1@gmail.com> writes:
    So: Which of those should a compiler support? Should it support more
    than one form? If so, should the language allow the programmer to
    specify which form to use on any particular string?

    I think the idea of C is to leave it up to the programmer.
    The C string literals and functions are just some kind of
    suggestion, and they help to provide basic services, such
    as printing some text to the terminal. But otherwise, the
    programmer is free to implement his own string type(s) or
    use string libraries.

    The choice depends on the expected type of use. For example,
    some ways to store strings are known as "ropes" (Hans J Boehm,
    1994), others are known as "gap buffers". A text editor
    might simultaneously use ropes for its text buffers and
    C strings for filenames.

    The crucial thing for allowing programmers to implement
    their own string type is that the languages is fast enough
    to do this with little overhead compared to an implementation
    of strings in the langugage itself. Implementing custom
    string representations in slow languages might not feasible.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dmitry A. Kazakov@21:1/5 to James Harris on Mon Oct 24 14:07:32 2022
    On 2022-10-24 12:31, James Harris wrote:
    Do you guys have any thoughts on the best ways for strings of characters
    to be stored?

    1. There's the C way, of course, of reserving one value (zero) and using
    it as a terminator.

    2. There's the 'length prefix' option of putting the length of the
    string in a machine word before the characters.

    3. There's the 'double pointer' way of pointing at, say, first and past (where 'past' means first plus length such that the second pointer
    points one position beyond the last character).

    Any others?

    4. String body only. The constraints are known outside.

    This is the way string slices and fixed length strings are implemented.
    In the later case the compiler knows the strings bounds (first and last
    indices and thus the length). In the former case the compiler passes a
    "string dope" along with the naked body. The dope contains the bounds.

    This has an effect on pointers. E.g. if you want slices and efficient
    raw strings you must distinguish pointers to definite (constrained) vs. indefinite (unconstrained) objects of same type.

    E.g. in Ada you cannot take an indefinite string pointer to a fixed
    length string because there is no bounds. If you wanted that feature you
    would use a "fat pointer" to carry bounds with it.

    This is similar to atomic, volatile objects and pointers to. The
    mechanics is same. You cannot take a general-purpose pointer to an
    atomic object, because the client code would not know that it should
    take care upon dereferencing.

    --
    Regards,
    Dmitry A. Kazakov
    http://www.dmitry-kazakov.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to James Harris on Mon Oct 24 15:28:44 2022
    On 24/10/2022 11:31, James Harris wrote:
    Do you guys have any thoughts on the best ways for strings of characters
    to be stored?

    1. There's the C way, of course, of reserving one value (zero) and using
    it as a terminator.

    2. There's the 'length prefix' option of putting the length of the
    string in a machine word before the characters.

    3. There's the 'double pointer' way of pointing at, say, first and past (where 'past' means first plus length such that the second pointer
    points one position beyond the last character).

    Any others?

    Options 1 and 2 have the advantage that they can be referred to simply
    by address. Option 3 needs an additional place in which to store the
    (first, past) control block.

    Option 1 has the advantage that it's easy for a program to process (by
    either pointer or index).

    Options 1 and 3 have the advantage that one can refer to the tail of the string (anything past the first character) without creating a copy,
    although option 3 would need a new control block to be created. Option 2 would require a new string to be created.

    In fact, option 3 has the advantage that it allows any continuous
    substring - head, mid, or tail - to be referred to without making a copy
    of the required part of the string.

    Options 2 and 3 make it fast to find the length. They also allow any
    value (i.e. including zero) to be part of the string.

    So: Which of those should a compiler support? Should it support more
    than one form? If so, should the language allow the programmer to
    specify which form to use on any particular string?

    If that's not complicated enough, the above essentially considers
    strings whose contents could be read-only or read-write but their
    lengths don't change. If the lengths can change then there are
    additional issues of storage management. Eek! ;)

    For lower level strings, I'd highly recommend using zero-terminated
    strings, or using them as the basis, or at least having it as an option.

    This is not the 'C way', as I'd long used this outside of C and Unix
    (eg. in DEC assembly, and in my own stuff for at least a decode before I
    first dealt with C.

    I still use them, and among many advantages such as pure simplicity,
    allow you to directly make use of innumerable APIs that specify such
    strings.

    They can be used in contexts such as the compact string fields of
    structs, since the only overhead is allowing space for that terminator **.


    The next step up, in lower level code, is to use a slice. This is a
    (pointer, length) descriptor. Here no terminator is necessary, and
    allows strings to also contain embedded zeros (so can contain any binary
    data).

    String slices can point into another string (allowing sharing), or into
    another slice, or into a regular zero-terminated string.

    However to call an API function expecting a zero-terminated string
    ('stringz` as I sometimes call it), the pointer is not enough: you need
    to ensure there's a zero following those <length> characters!


    Within my dynamic scripting language, I have a full-on counted string
    type, with reference counting to manage sharing and allow automatic
    memory management. But with the same headache when calling low-level FFI functions that expect C-like strings.

    But that language at least will cope with it.

    (** The scripting language can define structs with fixed types including fixed-width string fields. Those are defined in two ways:

    stringz*8 A
    stringc*8 B

    Both A and B occupy an 8-byte field. But A can store a maximum string of
    7 characters, with B it can be 8 characters.

    Yet B also includes the count so no scanning is needed to determine the
    string length. The scheme however only works on fields of 2 to 256
    characters.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Dmitry A. Kazakov on Wed Oct 26 20:43:11 2022
    On 24/10/2022 13:07, Dmitry A. Kazakov wrote:
    On 2022-10-24 12:31, James Harris wrote:
    Do you guys have any thoughts on the best ways for strings of
    characters to be stored?

    1. There's the C way, of course, of reserving one value (zero) and
    using it as a terminator.

    2. There's the 'length prefix' option of putting the length of the
    string in a machine word before the characters.

    3. There's the 'double pointer' way of pointing at, say, first and
    past (where 'past' means first plus length such that the second
    pointer points one position beyond the last character).

    Any others?

    4. String body only. The constraints are known outside.

    This is the way string slices and fixed length strings are implemented.
    In the later case the compiler knows the strings bounds (first and last indices and thus the length). In the former case the compiler passes a "string dope" along with the naked body. The dope contains the bounds.

    That doesn't seem meaningfully different from case 3. To be clear, case
    3 would be represented by, in addition to the bytes of the string,

    struct
    first: pointer to first byte of string
    past: pointer to byte after last byte of string
    .... other fields ....
    end struct

    The string length would be past - first. The bytes of the string would
    be those pointed at (which I presume is what you are calling the naked
    body).


    This has an effect on pointers. E.g. if you want slices and efficient
    raw strings you must distinguish pointers to definite (constrained) vs. indefinite (unconstrained) objects of same type.

    E.g. in Ada you cannot take an indefinite string pointer to a fixed
    length string because there is no bounds. If you wanted that feature you would use a "fat pointer" to carry bounds with it.

    Any reason you'd recommend against storing bounds as in the struct, above?


    This is similar to atomic, volatile objects and pointers to. The
    mechanics is same. You cannot take a general-purpose pointer to an
    atomic object, because the client code would not know that it should
    take care upon dereferencing.


    I am not sure what that means. I guess the point you are making is that
    there are levels of classification which don't affect the data type but
    they do affect how it can be accessed - with the language needing to
    prevent a reference weakening the storage model. For example, a
    read-write reference to a substring should be prevented from being used
    to access part of a string which is supposed to be read-only.



    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Stefan Ram on Wed Oct 26 21:52:06 2022
    On 24/10/2022 12:44, Stefan Ram wrote:
    James Harris <james.harris.1@gmail.com> writes:
    So: Which of those should a compiler support? Should it support more
    than one form? If so, should the language allow the programmer to
    specify which form to use on any particular string?

    I think the idea of C is to leave it up to the programmer.
    The C string literals and functions are just some kind of
    suggestion, and they help to provide basic services, such
    as printing some text to the terminal. But otherwise, the
    programmer is free to implement his own string type(s) or
    use string libraries.

    The choice depends on the expected type of use. For example,
    some ways to store strings are known as "ropes" (Hans J Boehm,
    1994), others are known as "gap buffers". A text editor
    might simultaneously use ropes for its text buffers and
    C strings for filenames.

    Thanks for the references.


    The crucial thing for allowing programmers to implement
    their own string type is that the languages is fast enough
    to do this with little overhead compared to an implementation
    of strings in the langugage itself. Implementing custom
    string representations in slow languages might not feasible.

    That's fair. I would like, however, to have an inbuilt string type that
    is easy to work with so that there's a pre-made standard and programmers
    don't have to come up with their own or to spend time working out what a previous programmer had created.

    I should have suggested a string or slice interface. Here's a first
    attempt at the operations a string would be expected to be hit with.

    These are part of the mechanics of string handling, relating to the
    structure of the string rather than to its contents, so I've not
    included anything in this list which looks at the content of the string.
    There are loads of content-based operations such as string comparisons,
    case conversion, whitespace trimming, etc, which could be built on top
    of the basic handling.

    Potential operations on string structures:
    * allocate a new string
    * create a slice (view) of an existing string
    * index into a string
    * increase the size of a string
    * reduce the size of a string
    * return the length of the string
    * append/delete characters from the end
    * insert/delete characters at the beginning
    * take a slice of a string
    * concatenate strings (including copying)
    * pass to and from functions

    The idea of slices is that they would appear the be strings but could be created to refer to the same string elements without allocating new
    storage for the sliced data.

    These are just some ideas on what might be required. To do this
    comprehensively seems rather complicated! :(


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Bart on Wed Oct 26 21:33:59 2022
    On 24/10/2022 15:28, Bart wrote:
    On 24/10/2022 11:31, James Harris wrote:

    Do you guys have any thoughts on the best ways for strings of
    characters to be stored?

    ..

    For lower level strings, I'd highly recommend using zero-terminated
    strings, or using them as the basis, or at least having it as an option.

    They certainly seem easiest to work with although they do have
    limitations such as:

    * cannot include a character with the encoding of zero (as you say)
    * must be scanned to determine length
    * awkward to add to or delete from the end of as they don't carry any
    data about whether the memory immediately following is available or not


    This is not the 'C way', as I'd long used this outside of C and Unix
    (eg. in DEC assembly, and in my own stuff for at least a decode before I first dealt with C.

    True, though C popularised the scheme. Besides, on the PDP one way of
    storing strings was apparently as

    (length, address)

    That's according to the Commercial Instruction Set (CIS) part of

    https://en.wikipedia.org/wiki/PDP-11_architecture


    I still use them, and among many advantages such as pure simplicity,
    allow you to directly make use of innumerable APIs that specify such
    strings.

    They can be used in contexts such as the compact string fields of
    structs, since the only overhead is allowing space for that terminator **.


    OK.


    The next step up, in lower level code, is to use a slice. This is a
    (pointer, length) descriptor. Here no terminator is necessary, and
    allows strings to also contain embedded zeros (so can contain any binary data).

    String slices can point into another string (allowing sharing), or into another slice, or into a regular zero-terminated string.

    That's more universal and therefore perhaps the best to implement if
    only one scheme is to be available. Have to say, though, I guess it
    would be hard to manage the memory for. Instead of just (first, length)
    or (first, past) perhaps one would need something like

    struct
    first: pointer to first element
    past: pointer just past last element
    count: number of slices pointing to this slice/string
    base: the parent string or memory
    flags: various
    end struct

    The base field would refer to the string object we were a slice of or,
    if we were not a slice but the base string, the memory area in which the
    string was stored.

    The flags would indicate whether the string/slice could have its
    contents changed and whether it could have its length changed, whether
    the contents could be moved in memory, etc.


    However to call an API function expecting a zero-terminated string
    ('stringz` as I sometimes call it), the pointer is not enough: you need
    to ensure there's a zero following those <length> characters!


    Within my dynamic scripting language, I have a full-on counted string
    type, with reference counting to manage sharing and allow automatic
    memory management.

    What fields did you use to manage such stuff? Am I on the right lines
    with the ideas above?


    But with the same headache when calling low-level FFI
    functions that expect C-like strings.

    Just a thought: ensure there is always at least one more byte of memory
    than the string requires and put a zero byte at the end of the string
    before calling any function which expects a C-like string. (User
    responsibility to ensure there are no zero bytes embedded in the string.)

    Perhaps one reason is that some predefined data structures include a fixed-length field in which a string can sit but which has no room for
    another byte such as a terminating zero. But for them the string could
    be copied out.

    Having a string defined by a (first, past) pair would perhaps allow fixed-length fields to be handled as easily as mutable strings.


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dmitry A. Kazakov@21:1/5 to James Harris on Thu Oct 27 09:28:51 2022
    On 2022-10-26 21:43, James Harris wrote:
    On 24/10/2022 13:07, Dmitry A. Kazakov wrote:
    On 2022-10-24 12:31, James Harris wrote:
    Do you guys have any thoughts on the best ways for strings of
    characters to be stored?

    1. There's the C way, of course, of reserving one value (zero) and
    using it as a terminator.

    2. There's the 'length prefix' option of putting the length of the
    string in a machine word before the characters.

    3. There's the 'double pointer' way of pointing at, say, first and
    past (where 'past' means first plus length such that the second
    pointer points one position beyond the last character).

    Any others?

    4. String body only. The constraints are known outside.

    This is the way string slices and fixed length strings are
    implemented. In the later case the compiler knows the strings bounds
    (first and last indices and thus the length). In the former case the
    compiler passes a "string dope" along with the naked body. The dope
    contains the bounds.

    That doesn't seem meaningfully different from case 3. To be clear, case
    3 would be represented by, in addition to the bytes of the string,

    struct
      first: pointer to first byte of string
      past:  pointer to byte after last byte of string
      .... other fields ....
    end struct

    The string length would be past - first. The bytes of the string would
    be those pointed at (which I presume is what you are calling the naked
    body).

    That is the structure of a string dope, not the string itself, unless
    you have the body in other fields, but then why would you need pointers?

    To clarify terms. String representation must include the string body if
    we are talking about values of strings. The things like pointers and
    vectorized dopes are references to a string, not strings. You can pass a
    string by a reference, sure. But the string value is somewhere else.
    What you pass is not a string it is a substitute.

    This has an effect on pointers. E.g. if you want slices and efficient
    raw strings you must distinguish pointers to definite (constrained)
    vs. indefinite (unconstrained) objects of same type.

    E.g. in Ada you cannot take an indefinite string pointer to a fixed
    length string because there is no bounds. If you wanted that feature
    you would use a "fat pointer" to carry bounds with it.

    Any reason you'd recommend against storing bounds as in the struct, above?

    Start with interoperability of strings and slices of. The crucial
    requirements would be:

    A slice can be passed to a subprogram expecting a string without
    copying.

    Consider efficiency and low-level close to hardware stuff:

    Aggregation of strings with known bounds does not require storing them.

    E.g. you can have arrays of fixed length strings (like an image buffer).
    If a member of a structure is a fixed length string, no bounds are
    stored. A pointer to a fixed length string is a plain pointer etc.

    This is similar to atomic, volatile objects and pointers to. The
    mechanics is same. You cannot take a general-purpose pointer to an
    atomic object, because the client code would not know that it should
    take care upon dereferencing.

    I am not sure what that means. I guess the point you are making is that
    there are levels of classification which don't affect the data type but
    they do affect how it can be accessed - with the language needing to
    prevent a reference weakening the storage model. For example, a
    read-write reference to a substring should be prevented from being used
    to access part of a string which is supposed to be read-only.

    Yes, it is a type constraint. There are all sorts of constraints one
    could put on a type in order to produce a constrained subtype.
    Constraining limits operations, e.g. immutability removes mutators. It
    also directs certain implementations like using locking instructions or dropping known bounds.

    --
    Regards,
    Dmitry A. Kazakov
    http://www.dmitry-kazakov.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to James Harris on Thu Oct 27 08:58:50 2022
    James Harris <james.harris.1@gmail.com> writes:
    Potential operations on string structures:
    * allocate a new string
    * create a slice (view) of an existing string
    * index into a string

    Many of such operations are provided by the standard library
    of C++. You could have a look at its implementation. One might
    even think of kinda "backporting" it to C. Or use C++.

    Suggested Video: "The strange details of std::string at
    Facebook" - Nicholas Ormrod (2016)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to James Harris on Thu Oct 27 12:14:28 2022
    On 26/10/2022 21:33, James Harris wrote:
    On 24/10/2022 15:28, Bart wrote:
    On 24/10/2022 11:31, James Harris wrote:

    Do you guys have any thoughts on the best ways for strings of
    characters to be stored?

    ..

    For lower level strings, I'd highly recommend using zero-terminated
    strings, or using them as the basis, or at least having it as an option.

    They certainly seem easiest to work with although they do have
    limitations such as:

    * cannot include a character with the encoding of zero (as you say)
    * must be scanned to determine length

    I use such strings in my 1Mlps compilers. Note that in that application:

    * You don't need to get string lengths that often
    * The vast majority of strings are short
    * String append (that you mention below) is mainly when speed is not
    critical (eg. for diagnostics)

    * awkward to add to or delete from the end of as they don't carry any
    data about whether the memory immediately following is available or not

    The next step up, in lower level code, is to use a slice. This is a
    (pointer, length) descriptor. Here no terminator is necessary, and
    allows strings to also contain embedded zeros (so can contain any
    binary data).

    String slices can point into another string (allowing sharing), or
    into another slice, or into a regular zero-terminated string.

    That's more universal and therefore perhaps the best to implement if
    only one scheme is to be available.

    Most strings are fixed-length once created; strings that can grow are
    rare. You don't need a 'capacity' field for example (like C++'s Vector
    type).

    But managing memory can still be an issue because you don't know if a particular slice owns its memory, or points to a string literal, or
    points into a shared string, or points to external memory.

    So a simple slice suits a lower-level language where you do this manual
    (it would be a welcome addition to C for example).

    My main language is just like this.

    Have to say, though, I guess it
    would be hard to manage the memory for. Instead of just (first, length)
    or (first, past) perhaps one would need something like

      struct
        first: pointer to first element
        past:  pointer just past last element
        count: number of slices pointing to this slice/string
        base:  the parent string or memory
        flags: various
      end struct

    The base field would refer to the string object we were a slice of or,
    if we were not a slice but the base string, the memory area in which the string was stored.

    The flags would indicate whether the string/slice could have its
    contents changed and whether it could have its length changed, whether
    the contents could be moved in memory, etc.


    However to call an API function expecting a zero-terminated string
    ('stringz` as I sometimes call it), the pointer is not enough: you
    need to ensure there's a zero following those <length> characters!



    Within my dynamic scripting language, I have a full-on counted string
    type, with reference counting to manage sharing and allow automatic
    memory management.

    What fields did you use to manage such stuff? Am I on the right lines
    with the ideas above?

    The structure I use is not lightweight because it is for interpreted
    code. The following object descriptor is a 32-byte record, used for all objects. I've shown only the fields used by string objects:

    record objrec =
    u32 refcount
    byte mutable # 1 for mutable strings
    byte objtype
    u16 dummy

    ichar strptr # (ref char)
    u64 length
    union
    u64 alloc64
    object objptr2 # (ref objptr)
    end
    end

    The string data itself is separate, pointed to by 'strptr'. This is nil
    when the length is zero (it doesn't point to ""). It is not
    zero-terminated (unless an external slice happens to be).

    Most strings are mutable, then .alloc64 gives the capacity of the
    allocation.

    An important field is objtype; its values are:

    Normal Regular string (uses alloc64)
    Slice Slice into another (uses objptr2)
    Extslice Strings lie outside the object scheme

    For slices, while .strptr refers to the string data in question,
    .objptr2 refs to the owner object of that string, which has its own
    refcount.

    External strings are those that belong to external code (eg. from an FFI function), or those occuring inside a packed struct field for example.

    So .objtype is used when sharing or freeing string data.

    As I said, this is for interpreted code which can afford to do this
    fiddly checking at runtime, which is not done inline either.

    For static languages using inline code, it might need to be more
    streamlined.

    Note that if you take those 32 bytes, then the middle 16 bytes (.strptr
    and .length fields) correspond to a raw Slice as used in my lower level language.


    But with the same headache when calling low-level FFI functions that
    expect C-like strings.

    Just a thought: ensure there is always at least one more byte of memory
    than the string requires and put a zero byte at the end of the string
    before calling any function which expects a C-like string. (User responsibility to ensure there are no zero bytes embedded in the string.)

    I think I tried that once. In general it doesn't work, as you might have
    a slice into another string; you can't inject a zero byte into the
    middle of that other string!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Stefan Ram on Thu Oct 27 10:19:34 2022
    ram@zedat.fu-berlin.de (Stefan Ram) writes:
    Many of such operations are provided by the standard library
    of C++. You could have a look at its implementation. One might
    even think of kinda "backporting" it to C. Or use C++.

    One could also look at the implementation of strings in
    Python. Python already is a library that can be used
    from C. So, one could use Python in C as a library just
    for its data types or just for string handling.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Dmitry A. Kazakov on Sat Oct 29 12:23:10 2022
    On 27/10/2022 08:28, Dmitry A. Kazakov wrote:
    On 2022-10-26 21:43, James Harris wrote:
    On 24/10/2022 13:07, Dmitry A. Kazakov wrote:
    On 2022-10-24 12:31, James Harris wrote:
    Do you guys have any thoughts on the best ways for strings of
    characters to be stored?

    1. There's the C way, of course, of reserving one value (zero) and
    using it as a terminator.

    2. There's the 'length prefix' option of putting the length of the
    string in a machine word before the characters.

    3. There's the 'double pointer' way of pointing at, say, first and
    past (where 'past' means first plus length such that the second
    pointer points one position beyond the last character).

    Any others?

    4. String body only. The constraints are known outside.

    This is the way string slices and fixed length strings are
    implemented. In the later case the compiler knows the strings bounds
    (first and last indices and thus the length). In the former case the
    compiler passes a "string dope" along with the naked body. The dope
    contains the bounds.

    That doesn't seem meaningfully different from case 3. To be clear,
    case 3 would be represented by, in addition to the bytes of the string,

    struct
       first: pointer to first byte of string
       past:  pointer to byte after last byte of string
       .... other fields ....
    end struct

    The string length would be past - first. The bytes of the string would
    be those pointed at (which I presume is what you are calling the naked
    body).

    That is the structure of a string dope, not the string itself, unless
    you have the body in other fields, but then why would you need pointers?

    Curious use of terms. I presume that by "dope" you mean a dope vector
    which can also be called a control block or a descriptor.

    As for this specific case, the same information can be conveyed in
    different ways: (start, length), (start, memsize), (first, last),
    (first, past). I chose the latter as it should be slightly faster than
    the others and does not run into problems when the elements are other
    than single bytes.

    To explain, for common operations,

    memsize() = past - first
    length() = memsize() >> alignbits
    forward iteration proceeds while address < past
    backward iteration proceeds while address >= first

    The only stipulation is that the body must not be allocated at the very
    top or bottom of the addressable range.

    Using (first, past) should be as simple as that. By contrast, the
    similar (first, last) runs into a slight problem when elements are wider
    than single bytes: should the last pointer point to the start or the end
    of the last item?

    The others, which involve memsize or length, make it slightly slower to
    judge the limits of iteration in the general case, requiring a
    calculation to see if a pointer is outside the limits of the string
    being referred to.


    To clarify terms. String representation must include the string body if
    we are talking about values of strings. The things like pointers and vectorized dopes are references to a string, not strings. You can pass a string by a reference, sure. But the string value is somewhere else.
    What you pass is not a string it is a substitute.

    That depends, surely, on how "a string" is defined. If strings are
    defined as descriptors starting with the fields first and past then the
    bodies of such strings can be elsewhere. (There would be other fields of
    a string descriptor to assist with memory management and probably some
    flags, though I am open to suggestions as to what those fields should be.)


    This has an effect on pointers. E.g. if you want slices and efficient
    raw strings you must distinguish pointers to definite (constrained)
    vs. indefinite (unconstrained) objects of same type.

    E.g. in Ada you cannot take an indefinite string pointer to a fixed
    length string because there is no bounds. If you wanted that feature
    you would use a "fat pointer" to carry bounds with it.

    Any reason you'd recommend against storing bounds as in the struct,
    above?

    Start with interoperability of strings and slices of. The crucial requirements would be:

       A slice can be passed to a subprogram expecting a string without copying.

    Indeed, that's a major benefit of slices, IMO, being able to pass
    something which looks and acts like a string but which doesn't need the elements of the string to be copied.

    That said, a slice would probably have a length which the callee can
    determine but which the callee cannot change. I presume that's what
    you'd call a constraint.

    If a callee wanted to be able to change the length of a string then it
    would have to be passed a real string, not a slice.

    I guess there would be these kinds of string argument:

    1. Read-write string. Anything could be done to the string by the
    callee. (Would have to be a real string.)

    2. Read-write fixed-length string. The string's contents could be
    altered but it could not be made longer or shorter. (Could be a real
    string or a slice.)

    3. Read-only string. Neither its length nor it contents could be altered
    by the callee. (Could be a real string or a slice.)


    Consider efficiency and low-level close to hardware stuff:

       Aggregation of strings with known bounds does not require storing them.

    E.g. you can have arrays of fixed length strings (like an image buffer).
    If a member of a structure is a fixed length string, no bounds are
    stored. A pointer to a fixed length string is a plain pointer etc.

    You mean the string bounds could be known at compile time, say, rather
    than at run time. Good point. Any suggestions on how that should be implemented?



    This is similar to atomic, volatile objects and pointers to. The
    mechanics is same. You cannot take a general-purpose pointer to an
    atomic object, because the client code would not know that it should
    take care upon dereferencing.

    I am not sure what that means. I guess the point you are making is
    that there are levels of classification which don't affect the data
    type but they do affect how it can be accessed - with the language
    needing to prevent a reference weakening the storage model. For
    example, a read-write reference to a substring should be prevented
    from being used to access part of a string which is supposed to be
    read-only.

    Yes, it is a type constraint. There are all sorts of constraints one
    could put on a type in order to produce a constrained subtype.
    Constraining limits operations, e.g. immutability removes mutators. It
    also directs certain implementations like using locking instructions or dropping known bounds.


    Was with you all the way until you mentioned dropping known bounds. What
    does that mean? How can it be legitimate to drop any bounds?


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to James Harris on Sat Oct 29 13:24:45 2022
    On 29/10/2022 12:23, James Harris wrote:
    On 27/10/2022 08:28, Dmitry A. Kazakov wrote:
    On 2022-10-26 21:43, James Harris wrote:
    On 24/10/2022 13:07, Dmitry A. Kazakov wrote:
    On 2022-10-24 12:31, James Harris wrote:
    Do you guys have any thoughts on the best ways for strings of
    characters to be stored?

    1. There's the C way, of course, of reserving one value (zero) and
    using it as a terminator.

    2. There's the 'length prefix' option of putting the length of the
    string in a machine word before the characters.

    3. There's the 'double pointer' way of pointing at, say, first and
    past (where 'past' means first plus length such that the second
    pointer points one position beyond the last character).

    Any others?

    4. String body only. The constraints are known outside.

    This is the way string slices and fixed length strings are
    implemented. In the later case the compiler knows the strings bounds
    (first and last indices and thus the length). In the former case the
    compiler passes a "string dope" along with the naked body. The dope
    contains the bounds.

    That doesn't seem meaningfully different from case 3. To be clear,
    case 3 would be represented by, in addition to the bytes of the string,

    struct
       first: pointer to first byte of string
       past:  pointer to byte after last byte of string
       .... other fields ....
    end struct

    The string length would be past - first. The bytes of the string
    would be those pointed at (which I presume is what you are calling
    the naked body).

    That is the structure of a string dope, not the string itself, unless
    you have the body in other fields, but then why would you need pointers?

    Curious use of terms. I presume that by "dope" you mean a dope vector
    which can also be called a control block or a descriptor.

    As for this specific case, the same information can be conveyed in
    different ways: (start, length), (start, memsize), (first, last),
    (first, past). I chose the latter as it should be slightly faster than
    the others and does not run into problems when the elements are other
    than single bytes.

    To explain, for common operations,

      memsize() = past - first
      length()  = memsize() >> alignbits
      forward iteration proceeds while address < past
      backward iteration proceeds while address >= first

    The only stipulation is that the body must not be allocated at the very
    top or bottom of the addressable range.

    Using (first, past) should be as simple as that. By contrast, the
    similar (first, last) runs into a slight problem when elements are wider
    than single bytes: should the last pointer point to the start or the end
    of the last item?

    The others, which involve memsize or length, make it slightly slower to
    judge the limits of iteration in the general case, requiring a
    calculation to see if a pointer is outside the limits of the string
    being referred to.


    To clarify terms. String representation must include the string body
    if we are talking about values of strings. The things like pointers
    and vectorized dopes are references to a string, not strings. You can
    pass a string by a reference, sure. But the string value is somewhere
    else. What you pass is not a string it is a substitute.

    That depends, surely, on how "a string" is defined. If strings are
    defined as descriptors starting with the fields first and past then the bodies of such strings can be elsewhere. (There would be other fields of
    a string descriptor to assist with memory management and probably some
    flags, though I am open to suggestions as to what those fields should be.)


    This has an effect on pointers. E.g. if you want slices and
    efficient raw strings you must distinguish pointers to definite
    (constrained) vs. indefinite (unconstrained) objects of same type.

    E.g. in Ada you cannot take an indefinite string pointer to a fixed
    length string because there is no bounds. If you wanted that feature
    you would use a "fat pointer" to carry bounds with it.

    Any reason you'd recommend against storing bounds as in the struct,
    above?

    Start with interoperability of strings and slices of. The crucial
    requirements would be:

        A slice can be passed to a subprogram expecting a string without
    copying.

    Indeed, that's a major benefit of slices, IMO, being able to pass
    something which looks and acts like a string but which doesn't need the elements of the string to be copied.

    That said, a slice would probably have a length which the callee can determine but which the callee cannot change. I presume that's what
    you'd call a constraint.

    If a callee wanted to be able to change the length of a string then it
    would have to be passed a real string, not a slice.

    I guess there would be these kinds of string argument:

    1. Read-write string. Anything could be done to the string by the
    callee. (Would have to be a real string.)

    2. Read-write fixed-length string. The string's contents could be
    altered but it could not be made longer or shorter. (Could be a real
    string or a slice.)

    3. Read-only string. Neither its length nor it contents could be altered
    by the callee. (Could be a real string or a slice.)

    4. Extensible string. This is not quite the same as your (1) which
    requires only a mutable string.

    You can mutate a string (alter individual characters) without needing to
    know the overall length or its allocated capacity.

    (You might further split that into mutable/non-mutable extensible
    strings. Usually if growing a string by appending to it, you don't want
    to also alter existing parts of the string.)

    (You probably need to consider Unicode strings too, especially if
    represented as UTF8, as the meaning of 'length' needs pinning down.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Stefan Ram on Sat Oct 29 14:10:49 2022
    On 27/10/2022 09:58, Stefan Ram wrote:
    James Harris <james.harris.1@gmail.com> writes:
    Potential operations on string structures:
    * allocate a new string
    * create a slice (view) of an existing string
    * index into a string

    Many of such operations are provided by the standard library
    of C++. You could have a look at its implementation. One might
    even think of kinda "backporting" it to C. Or use C++.

    Suggested Video: "The strange details of std::string at
    Facebook" - Nicholas Ormrod (2016)

    Thanks. I had a look at that video - and a number of others.


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Bart on Sat Oct 29 15:01:34 2022
    On 27/10/2022 12:14, Bart wrote:
    On 26/10/2022 21:33, James Harris wrote:
    On 24/10/2022 15:28, Bart wrote:
    On 24/10/2022 11:31, James Harris wrote:

    Do you guys have any thoughts on the best ways for strings of
    characters to be stored?

    ..

    String slices can point into another string (allowing sharing), or
    into another slice, or into a regular zero-terminated string.

    That's more universal and therefore perhaps the best to implement if
    only one scheme is to be available.

    Most strings are fixed-length once created; strings that can grow are
    rare. You don't need a 'capacity' field for example (like C++'s Vector
    type).

    Having watched some videos on string storage recently I now think I know
    what you mean by the capacity field - basically that a string descriptor
    would consist of these fields:

    start
    length
    capacity

    so that the string could be extended at the end (up to the capacity).
    That may be a bit restrictive. A programmer might want to remove or add characters at the beginning rather than just at the end, even though
    such would be done less often.

    So what do you think of having a string descriptor more like

    first
    past
    memfirst
    mempast

    where memfirst and mempast would define the allocated space in which the
    string body would sit.

    Or perhaps the descriptor should be unified with other references to
    memory - as I understand is true of your example, below.


    But managing memory can still be an issue because you don't know if a particular slice owns its memory, or points to a string literal, or
    points into a shared string, or points to external memory.

    Yes, some flags would be needed. They could be stored in the low-order
    bits of memfirst and mempast given that:

    a) allocations (hence, memfirst) could be aligned
    b) sizes of allocations (hence, mempast) could be rounded up to a
    suitable power of 2
    c) memfirst and mempast would need to be used far less often than first
    and past so there would be no great problem with the cost of masking out
    the low-order bits to get the addresses.

    ..

    Within my dynamic scripting language, I have a full-on counted string
    type, with reference counting to manage sharing and allow automatic
    memory management.

    What fields did you use to manage such stuff? Am I on the right lines
    with the ideas above?

    The structure I use is not lightweight because it is for interpreted
    code. The following object descriptor is a 32-byte record, used for all objects. I've shown only the fields used by string objects:

        record objrec =
            u32         refcount
            byte        mutable      # 1 for mutable strings
            byte        objtype
            u16         dummy

            ichar       strptr       # (ref char)
            u64         length
            union
                u64     alloc64
                object  objptr2      # (ref objptr)
            end
        end

    That looks very sensible. I have considered having 'sentient references'
    which would have a common format for anything which refers to memory (especially referents of dynamic size) and would include the address of
    a vtable for the specific type of sentient reference. The vtable would
    hold the addresses of methods which could be applied to the reference
    rather than to the referent. IOW the referent and the reference would
    each have a type.

    ATM I think I'd need to work through a lot more use cases before I would
    be ready to settle on the details of that so for now I may just go with
    the idea of a string descriptor.


    The string data itself is separate, pointed to by 'strptr'. This is nil
    when the length is zero (it doesn't point to ""). It is not
    zero-terminated (unless an external slice happens to be).

    Most strings are mutable, then .alloc64 gives the capacity of the
    allocation.

    An important field is objtype; its values are:

        Normal            Regular string (uses alloc64)
        Slice             Slice into another (uses objptr2)
        Extslice          Strings lie outside the object scheme

    OK. I may use something like that or, possibly, some flags.

    ..

    Note that if you take those 32 bytes, then the middle 16 bytes (.strptr
    and .length fields) correspond to a raw Slice as used in my lower level language.

    Good point. I'd need slices to have the same format as strings and for
    both to have flags. As there's no space for flags in the (first, past)
    pair I'd need to add a flags word, making the structure

    first
    past
    misc
    memfirst
    mempast

    where misc would store various pieces of information, not just flag
    bits. Slices would have only the first three fields. Strings would have
    all five. Flags would indicate whether this was a string or a slice.

    For me it's too early to optimise but it's worth noting that even for
    64-bit machines the above would occupy only 24 or 40 bytes of a 64-byte
    cache line so short string bodies could be stored in the same line,
    again with flags indicating that that was so.


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Bart on Sat Oct 29 15:16:27 2022
    On 29/10/2022 13:24, Bart wrote:
    On 29/10/2022 12:23, James Harris wrote:


    I guess there would be these kinds of string argument:

    1. Read-write string. Anything could be done to the string by the
    callee. (Would have to be a real string.)

    2. Read-write fixed-length string. The string's contents could be
    altered but it could not be made longer or shorter. (Could be a real
    string or a slice.)

    3. Read-only string. Neither its length nor it contents could be
    altered by the callee. (Could be a real string or a slice.)

    4. Extensible string. This is not quite the same as your (1) which
    requires only a mutable string.

    You mean a string which can be made longer but the existing contents
    could not be changed? I cannot think of a use case for that.


    You can mutate a string (alter individual characters) without needing to
    know the overall length or its allocated capacity.

    Wouldn't you need to know how long the string was so that a callee could
    make sure it was trying to modify characters within the string rather
    than memory locations outside it?


    (You might further split that into mutable/non-mutable extensible
    strings. Usually if growing a string by appending to it, you don't want
    to also alter existing parts of the string.)

    Mutable and extensible are good descriptions though as above I don't yet
    see the value in allowing a string to be extensible but its existing
    contents to be immutable.

    A slice would be inextensible but could be mutable or immutable, AISI.


    (You probably need to consider Unicode strings too, especially if
    represented as UTF8, as the meaning of 'length' needs pinning down.)

    I haven't mentioned it but ATM my chars are 32-bit and any 32-bit value
    can be stored in them, including zero. It also means there's no way to
    reserve a value for EOF so that condition has to be handled a different
    way from what C programmers are used to where EOF is a value which is
    outside the range permitted for chars. Challenges a plenty!


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to James Harris on Sat Oct 29 16:30:13 2022
    On 29/10/2022 15:16, James Harris wrote:
    On 29/10/2022 13:24, Bart wrote:
    On 29/10/2022 12:23, James Harris wrote:


    I guess there would be these kinds of string argument:

    1. Read-write string. Anything could be done to the string by the
    callee. (Would have to be a real string.)

    2. Read-write fixed-length string. The string's contents could be
    altered but it could not be made longer or shorter. (Could be a real
    string or a slice.)

    3. Read-only string. Neither its length nor it contents could be
    altered by the callee. (Could be a real string or a slice.)

    4. Extensible string. This is not quite the same as your (1) which
    requires only a mutable string.

    You mean a string which can be made longer but the existing contents
    could not be changed? I cannot think of a use case for that.

    That's a pattern I used all the time to incrementally build strings, for example to generate C or ASM source files from a language app.

    Or it can be as simple as this:

    errormess +:= " on line "+tostr(linenumber)

    Once extended, the existing parts of the string are never modified.

    Perhaps you can give an example of where mutating the characters of a
    string, extensible or otherwise, comes in useful.

    (My strings generally are mutable, but it's not a feature I use a great
    deal.

    For applications like text editors, I use a list of strings, one per
    line. And editing within each line create a new string for each edit. Efficiency here is not critical, and the needs are diverse, like
    deleting within the string, or insertion. It's just easier to construct
    a new one.)



    You can mutate a string (alter individual characters) without needing
    to know the overall length or its allocated capacity.

    Wouldn't you need to know how long the string was so that a callee could
    make sure it was trying to modify characters within the string rather
    than memory locations outside it?

    My point is that, given only the string pointer and an index or offset
    into it, that's all that's needed to modify it. If slices at least are
    used, then the callee could do bounds checking /if it wanted/.

    (My dynamic language does do runtime checking of bounds but, once an application has been developed, it is very, very rare that I have a
    bounds error come up. In a working, debugged program, it should not be necessary.)


    (You might further split that into mutable/non-mutable extensible
    strings. Usually if growing a string by appending to it, you don't
    want to also alter existing parts of the string.)

    Mutable and extensible are good descriptions though as above I don't yet
    see the value in allowing a string to be extensible but its existing
    contents to be immutable.

    A slice would be inextensible but could be mutable or immutable, AISI.


    (You probably need to consider Unicode strings too, especially if
    represented as UTF8, as the meaning of 'length' needs pinning down.)

    I haven't mentioned it but ATM my chars are 32-bit and any 32-bit value
    can be stored in them, including zero. It also means there's no way to reserve a value for EOF so that condition has to be handled a different
    way from what C programmers are used to where EOF is a value which is
    outside the range permitted for chars. Challenges a plenty!

    But you're not using all 2**32 bit patterns? It could reserve -1 or all
    1s for EOF just like C does. Because EOF would generally be used for character-at-a-time streaming, which is typically 8-bit anyway.

    Or have you developed a binary file system which works with 32-bit-wide 'bytes'?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Bart on Sat Oct 29 18:42:51 2022
    On 29/10/2022 16:30, Bart wrote:
    On 29/10/2022 15:16, James Harris wrote:
    On 29/10/2022 13:24, Bart wrote:
    On 29/10/2022 12:23, James Harris wrote:


    I guess there would be these kinds of string argument:

    1. Read-write string. Anything could be done to the string by the
    callee. (Would have to be a real string.)

    2. Read-write fixed-length string. The string's contents could be
    altered but it could not be made longer or shorter. (Could be a real
    string or a slice.)

    3. Read-only string. Neither its length nor it contents could be
    altered by the callee. (Could be a real string or a slice.)

    4. Extensible string. This is not quite the same as your (1) which
    requires only a mutable string.

    You mean a string which can be made longer but the existing contents
    could not be changed? I cannot think of a use case for that.

    That's a pattern I used all the time to incrementally build strings, for example to generate C or ASM source files from a language app.

    Or it can be as simple as this:

        errormess +:= " on line "+tostr(linenumber)

    Once extended, the existing parts of the string are never modified.

    Good examples. The 'extend' permission seems a bit specific although I
    accept that the uses you mention are common. I suppose it adds to the
    security of the language to be able to designate a string as extensible/inextensible separately from designating whether its existing contents can be changed or not.

    How would it be used? Thinking about functions which take a string as
    input, most strings would be purely inputs. They would therefore be both read-only and inextensible within the called function. Such arguments
    could be strings or slices.

    Further, functions which /return/ a string would create the string and
    return it whole.

    It is only functions which /modify/ a string, i.e. take it as an inout parameter, where it would matter whether the string was read/write or extensible. For an inout string what should be the defaults? If we say
    an inout string defaults to immutable and inextensible then that would
    lead to the following ways to specify a string, s, as a parameter:


    f: function(s: inout string char)
    f: function(s: inout string char rw)
    f: function(s: inout string char ext rw)
    f: function(s: inout string char ext)

    Note the "ext" and "rw" attributes. The idea is that they would specify
    how the string could be modified in the function. Adding rw would allow
    the string's existing contents to be taken as read-write rather than
    read-only. Adding ext would allow the string to be extended.

    That's effectively me thinking out loud and trying out some ideas. How
    does it look to you?

    What about other permissions such as prepend, split, insert, delete,
    etc? Perhaps it's too specific to have too many qualifiers although I
    can see value in using such info to help match caller and callee. For
    example, given the above one could say that as long as the callee
    doesn't specify the string as ext then it could be either a string or a
    slice. That is appealing from a security perspective.

    That said, can a compiler ensure that a string is not used in a way
    which breaks the contract indicated by its keywords? You raise some big
    issues!


    Perhaps you can give an example of where mutating the characters of a
    string, extensible or otherwise, comes in useful.

    I intend a string to be simply an array whose length can be changed. The
    idea being that a program could have a string of integers, a string of
    floats etc just as easily as having a string of characters. As such,
    anything which changes the content of an array should also work on
    strings. For example, one might want to sort an array in place. As a
    string of characters one might want to convert lower case to upper case,
    etc.


    (My strings generally are mutable, but it's not a feature I use a great
    deal.

    For applications like text editors, I use a list of strings, one per
    line. And editing within each line create a new string for each edit. Efficiency here is not critical, and the needs are diverse, like
    deleting within the string, or insertion. It's just easier to construct
    a new one.)

    OK.

    ..


    (You might further split that into mutable/non-mutable extensible
    strings. Usually if growing a string by appending to it, you don't
    want to also alter existing parts of the string.)

    Mutable and extensible are good descriptions though as above I don't
    yet see the value in allowing a string to be extensible but its
    existing contents to be immutable.

    A slice would be inextensible but could be mutable or immutable, AISI.


    (You probably need to consider Unicode strings too, especially if
    represented as UTF8, as the meaning of 'length' needs pinning down.)

    I haven't mentioned it but ATM my chars are 32-bit and any 32-bit
    value can be stored in them, including zero. It also means there's no
    way to reserve a value for EOF so that condition has to be handled a
    different way from what C programmers are used to where EOF is a value
    which is outside the range permitted for chars. Challenges a plenty!

    But you're not using all 2**32 bit patterns? It could reserve -1 or all
    1s for EOF just like C does. Because EOF would generally be used for character-at-a-time streaming, which is typically 8-bit anyway.

    As above, the language is meant to treat strings as arrays. So AISI it
    should not ascribe any particular meaning to their contents.

    There are other ways. For example, my plan for EOF is twofold:

    1. to have it as an attribute of a file object

    2. to have an attempt to read at EOF throw a weak exception which would
    be a catchable way to end an iteration.


    Or have you developed a binary file system which works with 32-bit-wide 'bytes'?

    No, my system is nothing like that advanced. At present all bytes
    (octets) I read from disk are zero extended to 32 bits. And all chars I
    write to disk have their top 24 zero bits chopped off. Though please
    don't think that's by design. It's only a temporary measure while I get
    the compiler up and running properly. (The compiler and the compilable
    language are, at present, rather limited.) In the long term IO streams
    should be via typed channels where chars of octets (or some other size)
    could be handled natively.


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dmitry A. Kazakov@21:1/5 to James Harris on Sat Oct 29 22:17:27 2022
    On 2022-10-29 13:23, James Harris wrote:
    On 27/10/2022 08:28, Dmitry A. Kazakov wrote:
    On 2022-10-26 21:43, James Harris wrote:
    On 24/10/2022 13:07, Dmitry A. Kazakov wrote:
    On 2022-10-24 12:31, James Harris wrote:
    Do you guys have any thoughts on the best ways for strings of
    characters to be stored?

    1. There's the C way, of course, of reserving one value (zero) and
    using it as a terminator.

    2. There's the 'length prefix' option of putting the length of the
    string in a machine word before the characters.

    3. There's the 'double pointer' way of pointing at, say, first and
    past (where 'past' means first plus length such that the second
    pointer points one position beyond the last character).

    Any others?

    4. String body only. The constraints are known outside.

    This is the way string slices and fixed length strings are
    implemented. In the later case the compiler knows the strings bounds
    (first and last indices and thus the length). In the former case the
    compiler passes a "string dope" along with the naked body. The dope
    contains the bounds.

    That doesn't seem meaningfully different from case 3. To be clear,
    case 3 would be represented by, in addition to the bytes of the string,

    struct
       first: pointer to first byte of string
       past:  pointer to byte after last byte of string
       .... other fields ....
    end struct

    The string length would be past - first. The bytes of the string
    would be those pointed at (which I presume is what you are calling
    the naked body).

    That is the structure of a string dope, not the string itself, unless
    you have the body in other fields, but then why would you need pointers?

    Curious use of terms. I presume that by "dope" you mean a dope vector
    which can also be called a control block or a descriptor.

    As for this specific case, the same information can be conveyed in
    different ways: (start, length), (start, memsize), (first, last),
    (first, past). I chose the latter as it should be slightly faster than
    the others and does not run into problems when the elements are other
    than single bytes.

    Yes. The problem with (first, next) is that next could be inexpressible.
    Most difficulties arise with strings/arrays over enumerations and
    modular types. (first, last) has no such problem.

    Both have issues with empty strings, e.g. with a multitude of
    representations of. Compare with +/-0 problem for non-2-complement integers.

    Using (first, past) should be as simple as that. By contrast, the
    similar (first, last) runs into a slight problem when elements are wider
    than single bytes: should the last pointer point to the start or the end
    of the last item?

    Just do not use multibyte representations at all. E.g. UTF-8 string is represented by an array of *octets*. It has a view of an array of code
    points, but that is not the physical representation, only a view.

    To clarify terms. String representation must include the string body
    if we are talking about values of strings. The things like pointers
    and vectorized dopes are references to a string, not strings. You can
    pass a string by a reference, sure. But the string value is somewhere
    else. What you pass is not a string it is a substitute.

    That depends, surely, on how "a string" is defined.

    def String is a sequence of characters.

    There is not other definitions. Little depends on that because there is
    no requirement to represent string this way. You are completely free to
    choose any suitable representation.

    [...]

    Skipped description of a possible representation.

    This has an effect on pointers. E.g. if you want slices and
    efficient raw strings you must distinguish pointers to definite
    (constrained) vs. indefinite (unconstrained) objects of same type.

    E.g. in Ada you cannot take an indefinite string pointer to a fixed
    length string because there is no bounds. If you wanted that feature
    you would use a "fat pointer" to carry bounds with it.

    Any reason you'd recommend against storing bounds as in the struct,
    above?

    Start with interoperability of strings and slices of. The crucial
    requirements would be:

        A slice can be passed to a subprogram expecting a string without
    copying.

    Indeed, that's a major benefit of slices, IMO, being able to pass
    something which looks and acts like a string but which doesn't need the elements of the string to be copied.

    That said, a slice would probably have a length which the callee can determine but which the callee cannot change. I presume that's what
    you'd call a constraint.

    It could be a constraint for fixed length slices.

    If a callee wanted to be able to change the length of a string then it
    would have to be passed a real string, not a slice.

    A callee might pass a variable length slice, which, for example, can be enlarged or shortened. Many languages with dynamically allocated strings
    have this. You need to find some balance between flexibility of
    pool-allocated strings and efficiency of fixed length ones. If the
    language has a developed type system you can have both transparently interchangeable for the programmer. Note this is same discussion as with numbers. Programmers want all of them with an ability to pass one for
    another.

    I guess there would be these kinds of string argument:

    1. Read-write string. Anything could be done to the string by the
    callee. (Would have to be a real string.)

    2. Read-write fixed-length string. The string's contents could be
    altered but it could not be made longer or shorter. (Could be a real
    string or a slice.)

    3. Read-only string. Neither its length nor it contents could be altered
    by the callee. (Could be a real string or a slice.)

    Think of it in terms of constraints. Immutability is a constraint. Fixed
    length is a constraint. Bounded length is a constraint. Non-sliding
    lower bound is a constraint. Non-sliding upper bound is a constraint.

    This should cover all spectrum. You can express all cases in terms of constraints.

    Consider efficiency and low-level close to hardware stuff:

        Aggregation of strings with known bounds does not require storing
    them.

    E.g. you can have arrays of fixed length strings (like an image
    buffer). If a member of a structure is a fixed length string, no
    bounds are stored. A pointer to a fixed length string is a plain
    pointer etc.

    You mean the string bounds could be known at compile time, say, rather
    than at run time. Good point. Any suggestions on how that should be implemented?

    As I said, you just have the string body and nothing else in the representation. Compare it to numbers. You can have indefinite length
    integers, but for many reasons programmers stick to constrained variants
    like -2**15..2*15-1.

    This is similar to atomic, volatile objects and pointers to. The
    mechanics is same. You cannot take a general-purpose pointer to an
    atomic object, because the client code would not know that it should
    take care upon dereferencing.

    I am not sure what that means. I guess the point you are making is
    that there are levels of classification which don't affect the data
    type but they do affect how it can be accessed - with the language
    needing to prevent a reference weakening the storage model. For
    example, a read-write reference to a substring should be prevented
    from being used to access part of a string which is supposed to be
    read-only.

    Yes, it is a type constraint. There are all sorts of constraints one
    could put on a type in order to produce a constrained subtype.
    Constraining limits operations, e.g. immutability removes mutators. It
    also directs certain implementations like using locking instructions
    or dropping known bounds.

    Was with you all the way until you mentioned dropping known bounds. What
    does that mean? How can it be legitimate to drop any bounds?

    If you can deduce bounds why would you keep them? Again, consider 16-bit integer. Do you keep -2**15 and 2*15-1 anywhere? You do not. The same
    should happen to fixed length or fixed bounds strings.

    --
    Regards,
    Dmitry A. Kazakov
    http://www.dmitry-kazakov.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to James Harris on Sat Oct 29 22:02:48 2022
    On 29/10/2022 18:42, James Harris wrote:
    On 29/10/2022 16:30, Bart wrote:

    Further, functions which /return/ a string would create the string and
    return it whole.

    Not necessarily. My dynamic language can return a string which is a
    slice into another. (Slices are not exposed in this language; they are
    in the static one, where slices are distinct types.)

    Example:

    func trim(s) =
    if s.len=2 then return "" fi
    return s[2..$-1]
    end

    This trims the first and last character of string. But here it returns a
    slice into the original string. If I wanted a fresh copy, I'd have to
    use copy() inside the function, or copy() (or a special kind of
    assignment) outside it.



    It is only functions which /modify/ a string, i.e. take it as an inout parameter, where it would matter whether the string was read/write or extensible. For an inout string what should be the defaults? If we say
    an inout string defaults to immutable and inextensible then that would
    lead to the following ways to specify a string, s, as a parameter:


      f: function(s: inout string char)
      f: function(s: inout string char rw)
      f: function(s: inout string char ext rw)
      f: function(s: inout string char ext)

    Note the "ext" and "rw" attributes. The idea is that they would specify
    how the string could be modified in the function. Adding rw would allow
    the string's existing contents to be taken as read-write rather than read-only. Adding ext would allow the string to be extended.

    That's effectively me thinking out loud and trying out some ideas. How
    does it look to you?

    My preference is to keep it simple:

    (1) String parameters are immutable

    (2) String parameters are mutable (this is changing existing
    content but also allow extension, plus deletion etc - the
    works)

    (3) String parameter are assignable. This means that, in addition to
    (2), assigning to the parameter also replaces the caller's version


    Python allows only (2), when working with Lists, and only (1) when
    working with Strings (Strings are immutable, Lists are mutable)

    (3) Requires full reference parameters so won't work in Python.

    My scripting language allows (2) and (3) on both lists and strings. (1)
    is only possible by a flag within the object that renders it immutable
    (for example, passing a literal "ABC").

    When I mentioned having extensibility as a different capability than
    mutation, it is because this could be done via different string types.

    There is in-place modification which changes the length of the object,
    and modification where the length is not changed; I think these could be useful, distinct attributes.

    Changing the length requires a reference to the /original/ descriptor
    where all the info is stored (heap pointer, length, capacity).

    But changing the contents without affecting the size either only needs
    the heap pointer, or can be done with a /copy/ of the descriptor; it
    doesn't not need the original.

    (My first implementation of a string type, on a 16- then 32-bit machine,
    used a 16-byte descriptor passed by value. The string was mutable, but
    it was not possible to extend it without a proper reference.)





    What about other permissions such as prepend, split, insert, delete,
    etc?

    These all count as in-place modifications (except split), but as I said
    above, it might be useful to treat length-modifying ones differently.

    It's not clear how 'split' works, but there are anyway all sorts of
    string ops that are not 'in-place'; they simply create new strings.
    Presumably 'split' creates 2 or more new strings.

    I intend a string to be simply an array whose length can be changed.

    I treat a string as one composite object normally treated as a single
    value (like a record). I treat an array or list a collection of distinct objects. But this is a minor point (it affects hows [] indexing works).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to James Harris on Sat Oct 29 22:25:39 2022
    On 29/10/2022 15:01, James Harris wrote:
    On 27/10/2022 12:14, Bart wrote:

    Most strings are fixed-length once created; strings that can grow are
    rare. You don't need a 'capacity' field for example (like C++'s Vector
    type).

    Having watched some videos on string storage recently I now think I know
    what you mean by the capacity field - basically that a string descriptor would consist of these fields:

      start
      length
      capacity

    so that the string could be extended at the end (up to the capacity).
    That may be a bit restrictive. A programmer might want to remove or add characters at the beginning rather than just at the end, even though
    such would be done less often.

    Doing a prepend is not a problem. What's critical is whether the new
    length is still within the current allocation. (Prepend requires
    shifting of the old string so is less efficient anyway.)

    If a new allocation is needed, you may be copying data for both prepend
    and append.

    With delete however, you may need to think about whether to /reduce/ the allocation size.

    So what do you think of having a string descriptor more like

      first
      past
      memfirst
      mempast

    where memfirst and mempast would define the allocated space in which the string body would sit.

    What's the difference between 'first' and 'memfirst'? Would you have a
    string that doesn't start at the beginning of its allocated block?


    An important field is objtype; its values are:

         Normal            Regular string (uses alloc64)
         Slice             Slice into another (uses objptr2)
         Extslice          Strings lie outside the object scheme

    OK. I may use something like that or, possibly, some flags.

    ..

    Note that if you take those 32 bytes, then the middle 16 bytes
    (.strptr and .length fields) correspond to a raw Slice as used in my
    lower level language.

    Good point. I'd need slices to have the same format as strings and for
    both to have flags. As there's no space for flags in the (first, past)
    pair I'd need to add a flags word, making the structure

      first
      past
      misc
      memfirst
      mempast

    where misc would store various pieces of information, not just flag
    bits. Slices would have only the first three fields. Strings would have
    all five. Flags would indicate whether this was a string or a slice.

    For me it's too early to optimise but it's worth noting that even for
    64-bit machines the above would occupy only 24 or 40 bytes of a 64-byte
    cache line so short string bodies could be stored in the same line,
    again with flags indicating that that was so.

    I think that if your string implementation requires a 24 or 40-byte
    descriptor, then thinking about cache-line optimisation /is/ premature!

    I considered such a descriptor too heavyweight for my static language.
    (I did incorporate such a string type once, intended for uses where
    performance didn't matter: sorting out UI, printing error messages and diagnostics, that sort of thing.)

    In the end I decided it did't really fit. But then I have two languages.

    I guess yours likely sits between my two.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Dmitry A. Kazakov on Sun Oct 30 11:24:33 2022
    On 29/10/2022 21:17, Dmitry A. Kazakov wrote:
    On 2022-10-29 13:23, James Harris wrote:

    ..

    As for this specific case, the same information can be conveyed in
    different ways: (start, length), (start, memsize), (first, last),
    (first, past). I chose the latter as it should be slightly faster than
    the others and does not run into problems when the elements are other
    than single bytes.

    Yes. The problem with (first, next) is that next could be inexpressible.
    Most difficulties arise with strings/arrays over enumerations and
    modular types. (first, last) has no such problem.

    Both have issues with empty strings, e.g. with a multitude of
    representations of. Compare with +/-0 problem for non-2-complement
    integers.

    That sounds interesting. Do you see multiple representations of the
    empty string in the following? Monospacing required. Here's how the
    string "abcd" would be stored

    !_a_!_b_!_c_!_d_!

    ^ ^
    ! !
    first past

    * so first would point at the first element of the string
    * and past would point one cell beyond the last element of the string.

    I don't see where you see a multitude of representations of the null
    string. AISI the empty string would simply have past equal to first in
    all cases.

    ..

    That said, a slice would probably have a length which the callee can
    determine but which the callee cannot change. I presume that's what
    you'd call a constraint.

    It could be a constraint for fixed length slices.

    If a callee wanted to be able to change the length of a string then it
    would have to be passed a real string, not a slice.

    A callee might pass a variable length slice, which, for example, can be enlarged or shortened. Many languages with dynamically allocated strings
    have this.

    What is your definition of a slice? Is it /part/ of an underlying string
    or is it a /copy/ of part of a string? For example, if

    string S = "abcde"
    slice T = S[1..3] ;"bcd"

    then changes to T would do what to S?

    If slice is a view of an underlying string (which is what I had in mind)
    then I don't get how you could meaningfully enlarge or shorten it.

    ..

    I guess there would be these kinds of string argument:

    1. Read-write string. Anything could be done to the string by the
    callee. (Would have to be a real string.)

    2. Read-write fixed-length string. The string's contents could be
    altered but it could not be made longer or shorter. (Could be a real
    string or a slice.)

    3. Read-only string. Neither its length nor it contents could be
    altered by the callee. (Could be a real string or a slice.)

    Think of it in terms of constraints. Immutability is a constraint. Fixed length is a constraint. Bounded length is a constraint. Non-sliding
    lower bound is a constraint. Non-sliding upper bound is a constraint.

    This should cover all spectrum. You can express all cases in terms of constraints.

    I presume such constraints would be specified when objects are declared.
    As a programmer how would you want to specify such constraints? Would
    each have a reserved word, for example?


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dmitry A. Kazakov@21:1/5 to James Harris on Sun Oct 30 15:20:03 2022
    On 2022-10-30 12:24, James Harris wrote:
    On 29/10/2022 21:17, Dmitry A. Kazakov wrote:
    On 2022-10-29 13:23, James Harris wrote:

    ..

    As for this specific case, the same information can be conveyed in
    different ways: (start, length), (start, memsize), (first, last),
    (first, past). I chose the latter as it should be slightly faster
    than the others and does not run into problems when the elements are
    other than single bytes.

    Yes. The problem with (first, next) is that next could be
    inexpressible. Most difficulties arise with strings/arrays over
    enumerations and modular types. (first, last) has no such problem.

    Both have issues with empty strings, e.g. with a multitude of
    representations of. Compare with +/-0 problem for non-2-complement
    integers.

    That sounds interesting. Do you see multiple representations of the
    empty string in the following? Monospacing required. Here's how the
    string "abcd" would be stored

      !_a_!_b_!_c_!_d_!

        ^               ^
        !               !
      first            past

    * so first would point at the first element of the string
    * and past would point one cell beyond the last element of the string.

    I don't see where you see a multitude of representations of the null
    string. AISI the empty string would simply have past equal to first in
    all cases.

    ...
    (0..0)
    (1..1)
    (2..2)
    ...
    (n..n)
    ...

    With pointers it becomes even worse as some of them might point to
    invalid addresses.

    That said, a slice would probably have a length which the callee can
    determine but which the callee cannot change. I presume that's what
    you'd call a constraint.

    It could be a constraint for fixed length slices.

    If a callee wanted to be able to change the length of a string then
    it would have to be passed a real string, not a slice.

    A callee might pass a variable length slice, which, for example, can
    be enlarged or shortened. Many languages with dynamically allocated
    strings have this.

    What is your definition of a slice? Is it /part/ of an underlying string
    or is it a /copy/ of part of a string? For example, if

      string S = "abcde"
      slice T = S[1..3]  ;"bcd"

    then changes to T would do what to S?

    No idea. It depends. Is slice in your example an independent object?

    But considering this:

    declare
    S : String := "abcde";
    begin
    S (1..3) := "x"; -- Illegal in Ada

    But should it be legal, then the result would be

    "xde"

    Many implementations make this illegal because it would require either
    bounded or dynamically allocated unbounded string.

    You can consider make it legal for these, but then you would have
    different semantics of slices for different strings. And this would
    contradict the design principle of having all strings interchangeable regardless the implementation method.

    There are contradictions in requirements you as the language designer
    has to resolve this or that way.

    If slice is a view of an underlying string (which is what I had in mind)
    then I don't get how you could meaningfully enlarge or shorten it.

    It is only your limited understanding of view as immutable and fixed
    length. E.g. if you view a house in infrared why should not you be able
    to open its door? Infrared googles would not limit you. Infrared photo
    of a house would! (:-))

    I guess there would be these kinds of string argument:

    1. Read-write string. Anything could be done to the string by the
    callee. (Would have to be a real string.)

    2. Read-write fixed-length string. The string's contents could be
    altered but it could not be made longer or shorter. (Could be a real
    string or a slice.)

    3. Read-only string. Neither its length nor it contents could be
    altered by the callee. (Could be a real string or a slice.)

    Think of it in terms of constraints. Immutability is a constraint.
    Fixed length is a constraint. Bounded length is a constraint.
    Non-sliding lower bound is a constraint. Non-sliding upper bound is a
    constraint.

    This should cover all spectrum. You can express all cases in terms of
    constraints.

    I presume such constraints would be specified when objects are declared.

    Objects and/or subtypes. Depending on the language preferences. Note
    also that you can have constrained views of the same object. E.g. you
    have a mutable variable passed down as in-argument. That would be an
    immutable view of the same object.

    As a programmer how would you want to specify such constraints? Would
    each have a reserved word, for example?

    In some cases constraints might be implied. But usually language have
    lots of [sub]type modifiers like

    in, in out, out, constant
    atomic, volatile, shared
    aliased (can get pointers to)
    external, static
    public, private, protected (visibility constraints)
    range, length, bounds
    parameter AKA discriminant (general purpose constraint)
    specific type AKA static/dynamic up/downcast (view as another type)
    class-wide (view as a class of types rooted in this one)
    ...
    measurement unit

    --
    Regards,
    Dmitry A. Kazakov
    http://www.dmitry-kazakov.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to Dmitry A. Kazakov on Sun Oct 30 14:51:53 2022
    On 30/10/2022 14:20, Dmitry A. Kazakov wrote:
    On 2022-10-30 12:24, James Harris wrote:
    On 29/10/2022 21:17, Dmitry A. Kazakov wrote:
    On 2022-10-29 13:23, James Harris wrote:

    ..

    As for this specific case, the same information can be conveyed in
    different ways: (start, length), (start, memsize), (first, last),
    (first, past). I chose the latter as it should be slightly faster
    than the others and does not run into problems when the elements are
    other than single bytes.

    Yes. The problem with (first, next) is that next could be
    inexpressible. Most difficulties arise with strings/arrays over
    enumerations and modular types. (first, last) has no such problem.

    Both have issues with empty strings, e.g. with a multitude of
    representations of. Compare with +/-0 problem for non-2-complement
    integers.

    That sounds interesting. Do you see multiple representations of the
    empty string in the following? Monospacing required. Here's how the
    string "abcd" would be stored

       !_a_!_b_!_c_!_d_!

         ^               ^
         !               !
       first            past

    * so first would point at the first element of the string
    * and past would point one cell beyond the last element of the string.

    I don't see where you see a multitude of representations of the null
    string. AISI the empty string would simply have past equal to first in
    all cases.

       ...
       (0..0)
       (1..1)
       (2..2)
       ...
       (n..n)
       ...

    With pointers it becomes even worse as some of them might point to
    invalid addresses.

    I don't know what these numbers mean. The main problem with 'first' and
    'past' is that with an empty string, 'first' doesn't point anywhere, and
    'past' ends up pointing to that same place, wherever that is.

    I don't like it because that address is meaningless. Except possibly
    when refering to an empty slice of an actual string.

       string S = "abcde"
       slice T = S[1..3]  ;"bcd"

    then changes to T would do what to S?

    Let's try it:

    s ::= "abcde" # ::= is needed to make s (and t) mutable
    t := s[1..3]
    t[2]:="?"

    println s # a?cde
    println t # a?c

    The language doesn't allow an empty slice, say s[1..0], although it
    ought to be well-behaved (I think it just expects j>=i in s[i..j].)

    No idea. It depends. Is slice in your example an independent object?

    But considering this:

    declare
       S : String := "abcde";
    begin
       S (1..3) := "x"; -- Illegal in Ada

    But should it be legal, then the result would be

      "xde"

    Many implementations make this illegal because it would require either bounded or dynamically allocated unbounded string.

    The language gets to say how this works. In mine it would have to be
    like this:

    s ::= "abcde" # ::= creates a mutable copy
    s[1..3] := "xyz" # Can only insert string of matching length

    s ends up as "xyzde"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Charles Lindsey@21:1/5 to James Harris on Sun Oct 30 17:01:21 2022
    On 29/10/2022 18:42, James Harris wrote:

    As above, the language is meant to treat strings as arrays. So AISI it should not ascribe any particular meaning to their contents.

    In that case you should have a look at Algol68, where string are arrays of characters. Every array has an associated descriptor containing its bounds etc. But more importantly a REF to an array is best implemented by including the descriptor in the REF (being a strongly typed language, there is no necessity for REFs to assorted other types to have the same length - they are not necessarily just address values). This makes it easy to construct slices (immutable) and REFs to slices (so the slice is mutable), thus providing many of
    the features discussed in this thread. However there is no provision for extending (or shortening) an array, other than to create a new space to copy it into; one could provide library routines with smart features to avoid actual copying in some cases, and with a friendly interface which did not expose the messiness inside.

    --
    Charles H. Lindsey ---------At my New Home, still doing my own thing------
    Tel: +44 161 488 1845 Web: https://www.clerew.man.ac.uk Email: chl@clerew.man.ac.uk Snail-mail: Apt 40, SK8 5BF, U.K.
    PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From luserdroog@21:1/5 to James Harris on Sun Oct 30 09:21:45 2022
    On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:
    Do you guys have any thoughts on the best ways for strings of characters
    to be stored?

    1. There's the C way, of course, of reserving one value (zero) and using
    it as a terminator.

    2. There's the 'length prefix' option of putting the length of the
    string in a machine word before the characters.

    3. There's the 'double pointer' way of pointing at, say, first and past (where 'past' means first plus length such that the second pointer
    points one position beyond the last character).

    Any others?

    Options 1 and 2 have the advantage that they can be referred to simply
    by address. Option 3 needs an additional place in which to store the
    (first, past) control block.

    Option 1 has the advantage that it's easy for a program to process (by
    either pointer or index).

    Options 1 and 3 have the advantage that one can refer to the tail of the string (anything past the first character) without creating a copy,
    although option 3 would need a new control block to be created. Option 2 would require a new string to be created.

    In fact, option 3 has the advantage that it allows any continuous
    substring - head, mid, or tail - to be referred to without making a copy
    of the required part of the string.

    Options 2 and 3 make it fast to find the length. They also allow any
    value (i.e. including zero) to be part of the string.

    So: Which of those should a compiler support? Should it support more
    than one form? If so, should the language allow the programmer to
    specify which form to use on any particular string?

    If that's not complicated enough, the above essentially considers
    strings whose contents could be read-only or read-write but their
    lengths don't change. If the lengths can change then there are
    additional issues of storage management. Eek! ;)

    Recommendations welcome!


    I think an exhaustive list of options would be very large if you're not pre-judging and filtering as you're adding options.

    4) [List|Array|Tuple|Iterator] of character objects

    5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
    can be used to format the data to squeeze it into 7 bits.

    6) Use UCS4 codes (24bit) padded out to 32 bits, and then you get a
    whole byte for metadata attached to each character.

    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Bart on Sun Oct 30 17:33:19 2022
    On 29/10/2022 22:25, Bart wrote:
    On 29/10/2022 15:01, James Harris wrote:
    On 27/10/2022 12:14, Bart wrote:

    Most strings are fixed-length once created; strings that can grow are
    rare. You don't need a 'capacity' field for example (like C++'s
    Vector type).

    Having watched some videos on string storage recently I now think I
    know what you mean by the capacity field - basically that a string
    descriptor would consist of these fields:

       start
       length
       capacity

    so that the string could be extended at the end (up to the capacity).
    That may be a bit restrictive. A programmer might want to remove or
    add characters at the beginning rather than just at the end, even
    though such would be done less often.

    Doing a prepend is not a problem. What's critical is whether the new
    length is still within the current allocation. (Prepend requires
    shifting of the old string so is less efficient anyway.)

    Well, see below.


    If a new allocation is needed, you may be copying data for both prepend
    and append.

    Yes.


    With delete however, you may need to think about whether to /reduce/ the allocation size.

    Agreed.


    So what do you think of having a string descriptor more like

       first
       past
       memfirst
       mempast

    where memfirst and mempast would define the allocated space in which
    the string body would sit.

    What's the difference between 'first' and 'memfirst'?

    memfirst would point to the start of the block in which the string body existed. (first would point at the same address or a later address.)

    Would you have a
    string that doesn't start at the beginning of its allocated block?

    Yes, that would be useful in some cases. For example, if deleting the
    first part of a string one wouldn't want to be forced to copy the rest
    of it down.

    And there are cases where strings may be built by prepending. A classic
    example is construction of a network frame. Each layer adds a header
    which, naturally, has to go on the front.


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Bart on Sun Oct 30 17:52:41 2022
    On 29/10/2022 22:02, Bart wrote:
    On 29/10/2022 18:42, James Harris wrote:
    On 29/10/2022 16:30, Bart wrote:

    Further, functions which /return/ a string would create the string and
    return it whole.

    Not necessarily. My dynamic language can return a string which is a
    slice into another. (Slices are not exposed in this language; they are
    in the static one, where slices are distinct types.)

    Example:

        func trim(s) =
            if s.len=2 then return "" fi
            return s[2..$-1]
        end

    This trims the first and last character of string. But here it returns a slice into the original string. If I wanted a fresh copy, I'd have to
    use copy() inside the function, or copy() (or a special kind of
    assignment) outside it.

    That's a challenging example. In a sense it returns either of two
    different types: the caller could be handed a string or a slice.


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to luserdroog on Sun Oct 30 18:13:42 2022
    On 30/10/2022 16:21, luserdroog wrote:
    On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:
    Do you guys have any thoughts on the best ways for strings of characters
    to be stored?

    1. There's the C way, of course, of reserving one value (zero) and using
    it as a terminator.

    2. There's the 'length prefix' option of putting the length of the
    string in a machine word before the characters.

    3. There's the 'double pointer' way of pointing at, say, first and past
    (where 'past' means first plus length such that the second pointer
    points one position beyond the last character).

    Any others?

    ..

    I think an exhaustive list of options would be very large if you're not pre-judging and filtering as you're adding options.

    4) [List|Array|Tuple|Iterator] of character objects

    You mean where the characters are stored individually (one per node)?


    5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
    can be used to format the data to squeeze it into 7 bits.

    Interesting idea. It's certainly one I hadn't thought of.


    6) Use UCS4 codes (24bit) padded out to 32 bits, and then you get a
    whole byte for metadata attached to each character.

    That's definitely thinking outside the box. I can see it working if the
    user (the programmer) wanted a string of 24-bit values but it could be
    awkward in other cases such as if he wanted a string of 32-bit or 8-bit
    values. I don't think I mentioned it but I'd like the programmer to be
    able to choose what the elements of the string would be.


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Dmitry A. Kazakov on Sun Oct 30 17:46:16 2022
    On 30/10/2022 14:20, Dmitry A. Kazakov wrote:
    On 2022-10-30 12:24, James Harris wrote:

    ..

    I don't see where you see a multitude of representations of the null
    string. AISI the empty string would simply have past equal to first in
    all cases.

       ...
       (0..0)
       (1..1)
       (2..2)
       ...
       (n..n)
       ...

    With pointers it becomes even worse as some of them might point to
    invalid addresses.

    In the general case strings would live at arbitrary addresses so no
    meaning could be inferred from any address. In all cases

    past - first

    would define the length of the string. If the length was zero then it
    would be an empty string.

    That said, if the string could be extended then both past and first
    would have to point to allocated memory into which the extension could
    take place.

    ..

    What is your definition of a slice? Is it /part/ of an underlying
    string or is it a /copy/ of part of a string? For example, if

       string S = "abcde"
       slice T = S[1..3]  ;"bcd"

    then changes to T would do what to S?

    No idea. It depends. Is slice in your example an independent object?

    A slice, at the moment, at least, would be a view of part of a string. Extending the earlier example,

    !_a_!_b_!_c_!_d_!

    ^ ^
    ! !
    s_first s_past

    The string in the example would be "abcd" and the slice, delimited by
    s_first and s_past, would be the "bc" in the middle of the string. Note
    that the contents of the slice would not include the element at s_past.

    The slice would appear to be a string with constraints. By default its
    contents could be updated in place but it could not be made longer or
    shorter.

    A callee which wanted only to read a string (a common case) or to update
    a string in place should not have to care whether it was passed a string
    or a slice. For such a case, strings and slices would be interchangeable.


    But considering this:

    declare
       S : String := "abcde";
    begin
       S (1..3) := "x"; -- Illegal in Ada

    In Ada would the following be legal?

    S (1..3) := "xxx"; --replacement same size as what it is replacing

    I'd be happy with that.


    But should it be legal, then the result would be

      "xde"

    Many implementations make this illegal because it would require either bounded or dynamically allocated unbounded string.

    You can consider make it legal for these, but then you would have
    different semantics of slices for different strings. And this would contradict the design principle of having all strings interchangeable regardless the implementation method.

    I don't mind there being differences along the lines of 'constraints'
    where a less-constrained object can be passed to a callee which expects
    an object with such constraints or imposes more constraints, but not one
    which needs fewer constraints.

    ..

    I presume such constraints would be specified when objects are declared.

    Objects and/or subtypes. Depending on the language preferences. Note
    also that you can have constrained views of the same object. E.g. you
    have a mutable variable passed down as in-argument. That would be an immutable view of the same object.

    Yes, and an immutable object could not be passed to a callee which
    wanted a mutable object.


    As a programmer how would you want to specify such constraints? Would
    each have a reserved word, for example?

    In some cases constraints might be implied. But usually language have
    lots of [sub]type modifiers like

       in, in out, out, constant
       atomic, volatile, shared
       aliased (can get pointers to)
       external, static
       public, private, protected (visibility constraints)
       range, length, bounds
       parameter AKA discriminant (general purpose constraint)
       specific type AKA static/dynamic up/downcast (view as another type)
       class-wide (view as a class of types rooted in this one)
       ...
       measurement unit

    So you wouldn't have a keyword to indicate a constraint such as
    "Non-sliding lower bound" which you mentioned before but IIUC you might
    have some qualification of the 'bounds' keyword as in

    bounds(^..)

    to indicate an unchangeable lower bound (with ^ meaning the start of the string)?


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dmitry A. Kazakov@21:1/5 to James Harris on Sun Oct 30 19:28:50 2022
    On 2022-10-30 18:46, James Harris wrote:
    On 30/10/2022 14:20, Dmitry A. Kazakov wrote:
    On 2022-10-30 12:24, James Harris wrote:

    ..

    I don't see where you see a multitude of representations of the null
    string. AISI the empty string would simply have past equal to first
    in all cases.

        ...
        (0..0)
        (1..1)
        (2..2)
        ...
        (n..n)
        ...

    With pointers it becomes even worse as some of them might point to
    invalid addresses.

    In the general case strings would live at arbitrary addresses so no
    meaning could be inferred from any address.

    As I said, that is a problem which will preclude some implementations
    and make others inefficient.

    And in general using pointers is wasting space as strings are much shorter.

    In all cases

      past - first

    would define the length of the string.

    Not really. Usually negative length is considered equivalent to zero,
    e.g. when iterating substrings. Other choices may consider iteration in reverse, when bounds are inverted.

    If the length was zero then it would be an empty string.

    Ergo, a special case to treat in many operations.

    But considering this:

    declare
        S : String := "abcde";
    begin
        S (1..3) := "x"; -- Illegal in Ada

    In Ada would the following be legal?

    Yes, in Ada slice length is constrained as the string length is.

      S (1..3) := "xxx";  --replacement same size as what it is replacing

    I'd be happy with that.

    It is still not fully defined. You need to consider the issue of sliding bounds. E.g.

    S (2..4) (2) := 'x'; -- Assign a character

    Now with sliding:

    S (2..4) (2) := 'x' gives "abxde", x is second in the slice

    without sliding

    S (2..4) (2) := 'x' gives "axcde", x is at 2 in the original string

    In Ada the right side slides, the left does not. Sliding the right side
    allows doing logical things like:

    S1 (1..5) := S1 (5..9); -- 5..9 slides to 1..5

    But should it be legal, then the result would be

       "xde"

    Many implementations make this illegal because it would require either
    bounded or dynamically allocated unbounded string.

    You can consider make it legal for these, but then you would have
    different semantics of slices for different strings. And this would
    contradict the design principle of having all strings interchangeable
    regardless the implementation method.

    I don't mind there being differences along the lines of 'constraints'
    where a less-constrained object can be passed to a callee which expects
    an object with such constraints or imposes more constraints, but not one which needs fewer constraints.

    No, the problem is with semantics. E.g. let in a subprogram you do

    S (1..100) := ""; -- Remove the first 100 characters

    S is a formal parameter. Then, depending on the actual parameter's
    subtype it may succeed (for a bounded length string) or fail (for a
    fixed length string). Such things are big no-no in language design,
    because they become a nightmare for programmers to track.

    I presume such constraints would be specified when objects are declared.

    Objects and/or subtypes. Depending on the language preferences. Note
    also that you can have constrained views of the same object. E.g. you
    have a mutable variable passed down as in-argument. That would be an
    immutable view of the same object.

    Yes, and an immutable object could not be passed to a callee which
    wanted a mutable object.

    Yes, lifting a constraint is not possible in most cases. However,
    dynamic cast is a counterexample. You can move the view up the
    inheritance tree. But this is frowned upon since it enforces certain implementations.

    As a programmer how would you want to specify such constraints? Would
    each have a reserved word, for example?

    In some cases constraints might be implied. But usually language have
    lots of [sub]type modifiers like

        in, in out, out, constant
        atomic, volatile, shared
        aliased (can get pointers to)
        external, static
        public, private, protected (visibility constraints)
        range, length, bounds
        parameter AKA discriminant (general purpose constraint)
        specific type AKA static/dynamic up/downcast (view as another type) >>     class-wide (view as a class of types rooted in this one)
        ...
        measurement unit

    So you wouldn't have a keyword to indicate a constraint such as
    "Non-sliding lower bound" which you mentioned before but IIUC you might
    have some qualification of the 'bounds' keyword as in

      bounds(^..)

    to indicate an unchangeable lower bound (with ^ meaning the start of the string)?

    I am not sure if sliding constraint might be usable. It is a different
    issue to constraining bounds because it involves operations like
    assignment. And it is not clear how to implement such a constraint
    effectively. Most constraints are either static (compile time), or
    simple to represent, like bounds or type tags. Sliding might be
    implemented as a flag, but then you will have to check it all the time.
    Maybe not worth having it as a choice. And it is unclear what is the unconstrained state, sliding or non-sliding? (:-))

    --
    Regards,
    Dmitry A. Kazakov
    http://www.dmitry-kazakov.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to James Harris on Sun Oct 30 22:37:36 2022
    On 30/10/2022 17:52, James Harris wrote:
    On 29/10/2022 22:02, Bart wrote:
    On 29/10/2022 18:42, James Harris wrote:
    On 29/10/2022 16:30, Bart wrote:

    Further, functions which /return/ a string would create the string
    and return it whole.

    Not necessarily. My dynamic language can return a string which is a
    slice into another. (Slices are not exposed in this language; they are
    in the static one, where slices are distinct types.)

    Example:

         func trim(s) =
             if s.len=2 then return "" fi
             return s[2..$-1]
         end

    This trims the first and last character of string. But here it returns
    a slice into the original string. If I wanted a fresh copy, I'd have
    to use copy() inside the function, or copy() (or a special kind of
    assignment) outside it.

    That's a challenging example. In a sense it returns either of two
    different types: the caller could be handed a string or a slice.


    In this language, it only has a String type, not a Slice. Slicing is an operation you apply on strings to yield another String object.

    (Internally, it has to distinguish between owned strings and slices into strings owned by other objects, but as I said that aspect is not exposed.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From luserdroog@21:1/5 to James Harris on Sun Oct 30 18:03:27 2022
    On Sunday, October 30, 2022 at 1:13:44 PM UTC-5, James Harris wrote:
    On 30/10/2022 16:21, luserdroog wrote:
    On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:
    Do you guys have any thoughts on the best ways for strings of characters >> to be stored?

    1. There's the C way, of course, of reserving one value (zero) and using >> it as a terminator.

    2. There's the 'length prefix' option of putting the length of the
    string in a machine word before the characters.

    3. There's the 'double pointer' way of pointing at, say, first and past
    (where 'past' means first plus length such that the second pointer
    points one position beyond the last character).

    Any others?
    ..
    I think an exhaustive list of options would be very large if you're not pre-judging and filtering as you're adding options.

    4) [List|Array|Tuple|Iterator] of character objects
    You mean where the characters are stored individually (one per node)?

    Yep. Fat characters, or whatever other scaffolding helps for the operations
    you want to support.

    5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
    can be used to format the data to squeeze it into 7 bits.
    Interesting idea. It's certainly one I hadn't thought of.

    It's used in the Classical Forth Dictionary header for the name field which
    can be variable length. Often it's followed by a length byte and you store
    the pointer to the length byte, using

    (length - *length)

    to actually get the pointer to the start.

    6) Use UCS4 codes (24bit) padded out to 32 bits, and then you get a
    whole byte for metadata attached to each character.
    That's definitely thinking outside the box. I can see it working if the
    user (the programmer) wanted a string of 24-bit values but it could be awkward in other cases such as if he wanted a string of 32-bit or 8-bit values. I don't think I mentioned it but I'd like the programmer to be
    able to choose what the elements of the string would be.

    This is what I was using in my unfinished APL-like language. The 8 bits
    of tag meant it was easy for a node to be a character or an integer (25 bit)
    or a nested subarray or whatever ... mpfp number... symbol table.

    So I didn't really need a string type *per se* because there's an array type whose data elements are these 32bits of encoded whatevs. A string is
    just a 1D array, or maybe an array of arrays or a 2D array padded out with spaces on each line. You'd read it in or receive it initially as a 1D array probably
    from a file. Oh, an element could also be a file.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to James Harris on Mon Oct 31 17:58:09 2022
    On 30/10/2022 19:13, James Harris wrote:
    On 30/10/2022 16:21, luserdroog wrote:
    On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:
    Do you guys have any thoughts on the best ways for strings of characters >>> to be stored?

    1. There's the C way, of course, of reserving one value (zero) and using >>> it as a terminator.

    2. There's the 'length prefix' option of putting the length of the
    string in a machine word before the characters.

    3. There's the 'double pointer' way of pointing at, say, first and past
    (where 'past' means first plus length such that the second pointer
    points one position beyond the last character).

    Any others?

    ..

    I think an exhaustive list of options would be very large if you're not
    pre-judging and filtering as you're adding options.

    4) [List|Array|Tuple|Iterator] of character objects

    You mean where the characters are stored individually (one per node)?


    5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
    can be used to format the data to squeeze it into 7 bits.

    Interesting idea. It's certainly one I hadn't thought of.

    Nor should you - that is a crazy idea. It is massively inefficient, as
    well as being inconsistent with everything else.



    6) Use UCS4 codes (24bit) padded out to 32 bits, and then you get a
    whole byte for metadata attached to each character.

    That's definitely thinking outside the box. I can see it working if the
    user (the programmer) wanted a string of 24-bit values but it could be awkward in other cases such as if he wanted a string of 32-bit or 8-bit values. I don't think I mentioned it but I'd like the programmer to be
    able to choose what the elements of the string would be.


    UCS4 is 31 bit, not 24 bit. Perhaps luserdroog is mixing it up with
    UTF-32, which can be covered by 21 bits. (The extra 10 bits in UCS4
    have never been anything but 0, but if you want to refer to a long
    out-dated and obsolete standard, it should still be done so accurately.)

    Do not make a new string or character storage system based around
    anything obsolete - that includes UTF-7 and UCS4. Making lots of extra
    work for yourself to support something that everyone else rejected as complicated, unnecessary and unused decades ago, is just silly.

    There are only two character encodings that make any kind of sense in a
    modern language (i.e., anything from this century). UTF-8 and
    /possibly/ UTF-32 for internal usage, if you find it more convenient for
    the operations you want.

    If you are using UTF-32 only for internal usage within the language, and
    not for export (external function calls, file IO, etc.), then the high
    byte is always unused - and therefore free for metadata if that's what
    you want. I'm not convinced it is a good idea, but it's certainly possible.

    For any kind of interaction with anything else, UTF-8 is the standard.
    There are a few other encodings that haven't died off completely yet,
    but they are all on the way out.

    I would also recommend treating characters and character strings as
    something very different from raw bytes and binary blobs. Users want to
    do very different things with them, and many of the useful operations
    are completely different. Some languages have made the mistake of
    conflating the two concepts - it's difficult to fix once that design
    flaw is set into a language.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to David Brown on Mon Oct 31 17:31:28 2022
    On 31/10/2022 16:58, David Brown wrote:
    On 30/10/2022 19:13, James Harris wrote:
    On 30/10/2022 16:21, luserdroog wrote:
    On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:
    Do you guys have any thoughts on the best ways for strings of
    characters
    to be stored?

    1. There's the C way, of course, of reserving one value (zero) and
    using
    it as a terminator.

    2. There's the 'length prefix' option of putting the length of the
    string in a machine word before the characters.

    3. There's the 'double pointer' way of pointing at, say, first and past >>>> (where 'past' means first plus length such that the second pointer
    points one position beyond the last character).

    Any others?

    ..

    I think an exhaustive list of options would be very large if you're not
    pre-judging and filtering as you're adding options.

    4) [List|Array|Tuple|Iterator] of character objects

    You mean where the characters are stored individually (one per node)?


    5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
    can be used to format the data to squeeze it into 7 bits.

    Interesting idea. It's certainly one I hadn't thought of.

    Nor should you - that is a crazy idea.  It is massively inefficient, as
    well as being inconsistent with everything else.

    It's a perfectly fine idea - for the 1970s.

    (Now the 8th bit is better put to use to represent UTF8.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Bart on Mon Oct 31 20:34:17 2022
    On 31/10/2022 18:31, Bart wrote:
    On 31/10/2022 16:58, David Brown wrote:
    On 30/10/2022 19:13, James Harris wrote:
    On 30/10/2022 16:21, luserdroog wrote:
    On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:
    Do you guys have any thoughts on the best ways for strings of
    characters
    to be stored?

    1. There's the C way, of course, of reserving one value (zero) and
    using
    it as a terminator.

    2. There's the 'length prefix' option of putting the length of the
    string in a machine word before the characters.

    3. There's the 'double pointer' way of pointing at, say, first and
    past
    (where 'past' means first plus length such that the second pointer
    points one position beyond the last character).

    Any others?

    ..

    I think an exhaustive list of options would be very large if you're not >>>> pre-judging and filtering as you're adding options.

    4) [List|Array|Tuple|Iterator] of character objects

    You mean where the characters are stored individually (one per node)?


    5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7 >>>> can be used to format the data to squeeze it into 7 bits.

    Interesting idea. It's certainly one I hadn't thought of.

    Nor should you - that is a crazy idea.  It is massively inefficient,
    as well as being inconsistent with everything else.

    It's a perfectly fine idea - for the 1970s.

    (Now the 8th bit is better put to use to represent UTF8.)

    Indeed.

    UTF-7 was invented in attempt to make an encoding for Unicode that would
    work for email, since some email servers did not handle 8-bit characters correctly, long ago in the dark ages. It was never formally a Unicode encoding, and almost never used in practice.

    Using 7-bit characters and the eighth bit for a terminator would be
    pretty inefficient - 12.5% wasted space to get a single bit of useful information per string. Not a good trade-off!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Bart on Thu Nov 3 16:10:38 2022
    On 30/10/2022 22:37, Bart wrote:
    On 30/10/2022 17:52, James Harris wrote:
    On 29/10/2022 22:02, Bart wrote:
    On 29/10/2022 18:42, James Harris wrote:
    On 29/10/2022 16:30, Bart wrote:

    Further, functions which /return/ a string would create the string
    and return it whole.

    Not necessarily. My dynamic language can return a string which is a
    slice into another. (Slices are not exposed in this language; they
    are in the static one, where slices are distinct types.)

    Example:

         func trim(s) =
             if s.len=2 then return "" fi
             return s[2..$-1]
         end

    This trims the first and last character of string. But here it
    returns a slice into the original string. If I wanted a fresh copy,
    I'd have to use copy() inside the function, or copy() (or a special
    kind of assignment) outside it.

    That's a challenging example. In a sense it returns either of two
    different types: the caller could be handed a string or a slice.


    In this language, it only has a String type, not a Slice. Slicing is an operation you apply on strings to yield another String object.

    (Internally, it has to distinguish between owned strings and slices into strings owned by other objects, but as I said that aspect is not exposed.)

    As most uses of slices would be read-only but some would be read-write,
    and there are various potential ways to implement a slice, it might be
    sensible for me to do something like:

    1. Have the default slice as read-only and the simplest to construct. If
    s is a string or another slice then a slice of part of that string could
    be constructed as

    s(1..4)

    2. Use keywords to effect other, more specialised, operations. For example,

    s.slice_cow(1..4) ;a copy-on-write slice
    s.slice_rw(1..4) ;a slice via which the base string can be changed
    s.copy(1..4) ;copy the designated chars to a new string


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to David Brown on Thu Nov 3 15:48:20 2022
    On 31/10/2022 16:58, David Brown wrote:
    On 30/10/2022 19:13, James Harris wrote:
    On 30/10/2022 16:21, luserdroog wrote:
    On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:

    Do you guys have any thoughts on the best ways for strings of
    characters
    to be stored?

    ..

    5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7
    can be used to format the data to squeeze it into 7 bits.

    Interesting idea. It's certainly one I hadn't thought of.

    Nor should you - that is a crazy idea.  It is massively inefficient, as
    well as being inconsistent with everything else.

    The model I have chosen (at least, for now) is to have a string indexed logically from zero (so indices do not need to be stored) and, for implementation, delimited by two pointers.

    The one downside I am aware of is that it will, at times, require
    creation and destruction of a small descriptor. I'll have to see how the approach works out in practice.

    ..

    I would also recommend treating characters and character strings as
    something very different from raw bytes and binary blobs.  Users want to
    do very different things with them, and many of the useful operations
    are completely different.  Some languages have made the mistake of conflating the two concepts - it's difficult to fix once that design
    flaw is set into a language.

    That sounds interesting but I cannot tell what you have in mind.

    One could consider strings as having two categories of operation: those
    which involve only the memory used by strings such as allocation, concatenation, insertion, deletion, etc; and those which care about the contents of a string such as capitalisation, comparison, whitespace recognition, parsing, etc. Why could the mechanics not apply to raw
    bytes and blobs?


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Dmitry A. Kazakov on Thu Nov 3 15:57:09 2022
    On 30/10/2022 18:28, Dmitry A. Kazakov wrote:
    On 2022-10-30 18:46, James Harris wrote:

    ..

    In Ada would the following be legal?

    Yes, in Ada slice length is constrained as the string length is.

       S (1..3) := "xxx";  --replacement same size as what it is replacing

    I'd be happy with that.

    It is still not fully defined. You need to consider the issue of sliding bounds. E.g.

      S (2..4) (2) := 'x'; -- Assign a character

    Now with sliding:

      S (2..4) (2) := 'x'  gives "abxde", x is second in the slice

    without sliding

      S (2..4) (2) := 'x'  gives "axcde", x is at 2 in the original string

    AISI the elements of strings and slices would always be accessed by
    offset. That appears to be the 'sliding' model you mention.


    In Ada the right side slides, the left does not. Sliding the right side allows doing logical things like:

      S1 (1..5) := S1 (5..9); -- 5..9 slides to 1..5

    I don't see any sliding there, only that chars 5 to 9 are copied over
    chars 1 to 5 of the same string.

    ..


    I am not sure if sliding constraint might be usable. It is a different
    issue to constraining bounds because it involves operations like
    assignment. And it is not clear how to implement such a constraint effectively. Most constraints are either static (compile time), or
    simple to represent, like bounds or type tags. Sliding might be
    implemented as a flag, but then you will have to check it all the time.
    Maybe not worth having it as a choice. And it is unclear what is the unconstrained state, sliding or non-sliding? (:-))


    Always sliding (if I understand the term)!


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to James Harris on Fri Nov 4 14:58:42 2022
    On 03/11/2022 16:48, James Harris wrote:
    On 31/10/2022 16:58, David Brown wrote:
    On 30/10/2022 19:13, James Harris wrote:
    On 30/10/2022 16:21, luserdroog wrote:
    On Monday, October 24, 2022 at 5:31:14 AM UTC-5, James Harris wrote:

    Do you guys have any thoughts on the best ways for strings of
    characters
    to be stored?

    ..

    5) Use 7 bits for data, 8th bit for terminator. Either ASCII7 or UTF-7 >>>> can be used to format the data to squeeze it into 7 bits.

    Interesting idea. It's certainly one I hadn't thought of.

    Nor should you - that is a crazy idea.  It is massively inefficient,
    as well as being inconsistent with everything else.

    The model I have chosen (at least, for now) is to have a string indexed logically from zero (so indices do not need to be stored) and, for implementation, delimited by two pointers.

    The one downside I am aware of is that it will, at times, require
    creation and destruction of a small descriptor. I'll have to see how the approach works out in practice.

    There is no universally ideal way to store a string - every method is inefficient for some uses and operations. You just have to find
    something that is good enough for typical uses in your language, add
    support for connecting to external code (such as exporting C-style
    strings), and make it possible for users to implement other types when
    it makes more sense for the user code.


    ..

    I would also recommend treating characters and character strings as
    something very different from raw bytes and binary blobs.  Users want
    to do very different things with them, and many of the useful
    operations are completely different.  Some languages have made the
    mistake of conflating the two concepts - it's difficult to fix once
    that design flaw is set into a language.

    That sounds interesting but I cannot tell what you have in mind.


    I mean you should consider a "string" to be a way of holding a sequence
    of "character" units which can hold a code unit of UTF-8 (since any
    other choice of character encoding is madness).

    This should be different from a "byte", which would be an 8-bit unit of
    raw memory (ignore the existence of machines that can't address 8-bit
    memory units directly - they will never use your language). Arrays of
    this type should always be contiguous, and used for raw memory access.
    (You may also want larger types - memory16, memory32, memory64, etc., if
    that is convenient for efficient usage.)

    So when you read a block of data from a file, or send a block to a
    network socket, it is an array of bytes - not a string. (You can have high-level abstractions for a "text file" wrapper that can read and
    write strings, but that's not fundamental.) If you have an equivalent
    of C's "memcpy" function it should use bytes, not any kind of character
    type. If you have something like C's "type-based aliasing rules", then
    it is bytes that should have the special exception, not characters.

    Neither "byte" nor "character" should have any kind of arithmetic
    operators - they are not integers. But you will need cast or conversion operations on them.

    The concept of "signed char" and "unsigned char" in C is a serious
    design flaw. A type designed to hold letters should not have a sign,
    and should not be used to hold arbitrary raw, low-level data.


    You might also consider not having a character type at all. Python 3
    has no character types - "a" is a string, not a character.


    One could consider strings as having two categories of operation: those
    which involve only the memory used by strings such as allocation, concatenation, insertion, deletion, etc; and those which care about the contents of a string such as capitalisation, comparison, whitespace recognition, parsing, etc. Why could the mechanics not apply to raw
    bytes and blobs?


    Operations on strings are those that are relevant to strings. It makes
    no sense to capitalise a raw binary blob. It makes no sense to have
    methods for linking or chaining sets of blobs - these are direct handles
    into memory, and chaining, allocation, etc., are higher level operations.

    Raw binary buffers require nothing more than an address and a size to
    describe them - anything more, and it is too high level. (Again,
    there's nothing wrong with providing higher level features and
    interfaces, but they have to build on the fundamental ones.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to David Brown on Fri Nov 4 17:50:59 2022
    On 04/11/2022 13:58, David Brown wrote:

    Neither "byte" nor "character" should have any kind of arithmetic
    operators - they are not integers.  But you will need cast or conversion operations on them.

    Bytes are small integers, typically of u8 type.

    I can't see why arithmetic can't be done with them, unless you want a
    purer kind of language where arithmetic is only allowed on signed
    numbers, and bitwise ops only on unsigned numbers, which is usually
    going to be a pain for all concerned.

    The concept of "signed char" and "unsigned char" in C is a serious
    design flaw.  A type designed to hold letters should not have a sign,
    and should not be used to hold arbitrary raw, low-level data.

    Signed and unsigned chars are not so bad; presumably C intended these to
    do the job of a 'byte' type for small integers. So it was just a poor
    choice of name. (After all there is no separate type in C for bytes
    holding character data.)

    What's bad is that third kind: a 'plain char' type, which is
    incompatible with both signed and unsigned char, even though it
    necessarily needs to be one of the other on a specific platform. It
    occurs in no other language, and causes problems within FFI APIs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Bart on Fri Nov 4 21:35:46 2022
    On 04/11/2022 17:50, Bart wrote:
    On 04/11/2022 13:58, David Brown wrote:

    Neither "byte" nor "character" should have any kind of arithmetic
    operators - they are not integers.  But you will need cast or
    conversion operations on them.

    Bytes are small integers, typically of u8 type.

    I can't see why arithmetic can't be done with them, unless you want a
    purer kind of language where arithmetic is only allowed on signed
    numbers, and bitwise ops only on unsigned numbers, which is usually
    going to be a pain for all concerned.

    I think what David means is that arithmetic operations don't apply to characters (even though some languages permit such operations). For
    example, neither

    'a' * 5

    nor even

    'R' + 1

    have any meaning over the set of characters. Prohibiting arithmetic on
    them could be dome but would make classifying and manipulating
    characters difficult unless one had a comprehensive set of library
    functions such as

    is_digit(char)
    is_alphanum(locale, char)
    is_lower(locale, char)
    upper(locale, char)

    and many more.


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to James Harris on Fri Nov 4 22:28:00 2022
    On 04/11/2022 21:35, James Harris wrote:
    On 04/11/2022 17:50, Bart wrote:
    On 04/11/2022 13:58, David Brown wrote:

    Neither "byte" nor "character" should have any kind of arithmetic
    operators - they are not integers.  But you will need cast or
    conversion operations on them.

    Bytes are small integers, typically of u8 type.

    I can't see why arithmetic can't be done with them, unless you want a
    purer kind of language where arithmetic is only allowed on signed
    numbers, and bitwise ops only on unsigned numbers, which is usually
    going to be a pain for all concerned.

    I think what David means is that arithmetic operations don't apply to characters

    I was picking on the 'byte' type; it seems extraordinary that you
    shouldn't be allowed to do arithmetic with them. If you can initialise a
    byte value with a number like this:

    byte a = 123

    then it's a number!

    (even though some languages permit such operations). For
    example, neither

      'a' * 5

    nor even

      'R' + 1

    have any meaning over the set of characters.

    I actually had such a restriction for a while: char*5 wasn't allowed,
    but char+1 was. After all why on earth shouldn't you want the next
    character in that alphabet? Why should code like this be made illegal:

    a := a * 10 + (c - '0')

    Then I realised I shouldn't be telling the programmer what they can and
    can't do with characters, as there might be some perfectly valid
    use-case that I simply hadn't thought of.

    Maybe 'a' * 5 yields the value 'aaaaa' or the string "aaaaa", or this is
    some kind on encryption algorithm.

    So now they are treated like integers, other than printing an array of
    char or pointer to char assumes they are strings.

    Prohibiting arithmetic on
    them could be dome but would make classifying and manipulating
    characters difficult unless one had a comprehensive set of library
    functions such as

      is_digit(char)
      is_alphanum(locale, char)
      is_lower(locale, char)
      upper(locale, char)

    and many more.

    As I said, you and I don't know all the possibilites. Of course there
    would need to be conversions between char and int, but this can become a nuisance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to David Brown on Sat Nov 5 11:14:01 2022
    On 04/11/2022 13:58, David Brown wrote:
    On 03/11/2022 16:48, James Harris wrote:
    On 31/10/2022 16:58, David Brown wrote:

    ..

    I would also recommend treating characters and character strings as
    something very different from raw bytes and binary blobs.  Users want
    to do very different things with them, and many of the useful
    operations are completely different.  Some languages have made the
    mistake of conflating the two concepts - it's difficult to fix once
    that design flaw is set into a language.

    That sounds interesting but I cannot tell what you have in mind.


    I mean you should consider a "string" to be a way of holding a sequence
    of "character" units which can hold a code unit of UTF-8 (since any
    other choice of character encoding is madness).

    We've discussed before that (IMO) Unicode is useful for physical
    printing to paper or electronic rendering such as to PDF but that it's a nightmare for programmers and users when it is used for any kind of
    input so I won't go over that again except to say that AISI Unicode
    should be handled by library functions rather than a language.

    What I do have in mind is strings of 'containers' where a string might
    be declared as of type

    string of char8 -- meaning a string of char8 containers
    string of char32 -- meaning a string of char32 containers

    What goes in each 8-bit or 32-bit 'container' would be another matter.

    That agnostic ideal is somewhat in tension with the desire to include
    string literals in a program text. For that, as I've mentioned before,
    my preference is to have the program text and any literals within it
    written in ASCII and American English; supplementary files would express
    the string literals in other languages.

    For example,

    print "Hello world"

    would be accompanied by a file for French which included

    "Hello world" --> "Bonjour le monde"

    Naturally, multilingual programming is much more complex than that
    simple example but it shows the basic idea. The compiler would be able
    to check that language files had everything required for a given piece
    of source code.

    ..

    Neither "byte" nor "character" should have any kind of arithmetic
    operators - they are not integers.  But you will need cast or conversion operations on them.

    The char8 and char32 containers could omit support for arithmetic **if**
    enough support routines were provided. But, as Bart says, it's difficult
    to anticipate all such support routines that programmers might need.


    The concept of "signed char" and "unsigned char" in C is a serious
    design flaw.  A type designed to hold letters should not have a sign,
    and should not be used to hold arbitrary raw, low-level data.

    OK.

    You might also consider not having a character type at all.  Python 3
    has no character types - "a" is a string, not a character.

    I can see that as being possible. I keep coming across examples of where
    a 1-character string would do just as well as a character. ATM, though,
    I have a separate character type.

    ..

    Raw binary buffers require nothing more than an address and a size to describe them - anything more, and it is too high level.  (Again,
    there's nothing wrong with providing higher level features and
    interfaces, but they have to build on the fundamental ones.)

    Maybe that's where I am going with this. What you describe does sound
    rather like my idea of using chars as containers and putting char and
    string handling in libraries.

    Incidentally, AISI I could have strings of any data type, not just
    chars. For example,

    string of int16
    string of struct {x: float, y: float}
    string of function(int, bool) -> uint

    I'll have to see how it works out in practice but the idea is to
    separate the concept of a string (basically, storage layout) from the
    concept of whatever type of element the string is made from.


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Bart on Sat Nov 5 10:36:33 2022
    On 04/11/2022 22:28, Bart wrote:
    On 04/11/2022 21:35, James Harris wrote:
    On 04/11/2022 17:50, Bart wrote:
    On 04/11/2022 13:58, David Brown wrote:

    Neither "byte" nor "character" should have any kind of arithmetic
    operators - they are not integers.  But you will need cast or
    conversion operations on them.

    Bytes are small integers, typically of u8 type.

    I can't see why arithmetic can't be done with them, unless you want a
    purer kind of language where arithmetic is only allowed on signed
    numbers, and bitwise ops only on unsigned numbers, which is usually
    going to be a pain for all concerned.

    I think what David means is that arithmetic operations don't apply to
    characters

    I was picking on the 'byte' type; it seems extraordinary that you
    shouldn't be allowed to do arithmetic with them. If you can initialise a
    byte value with a number like this:

        byte a = 123

    then it's a number!

    (even though some languages permit such operations). For example, neither

       'a' * 5

    nor even

       'R' + 1

    have any meaning over the set of characters.

    I actually had such a restriction for a while: char*5 wasn't allowed,
    but char+1 was. After all why on earth shouldn't you want the next
    character in that alphabet?

    That's because 'R' + 1 may not be the next character in all alphabets.
    Defining 'next' is more than difficult. It depends on intended collation
    order which varies in different parts of the world and can even change
    over time as authorities choose different collation orders. Some
    plausible meanings of 'R' + 1:

    'S' (as in ASCII)
    'r' (what the user may want as sort order)
    's' (what the user may want as sort order)
    a non-character (as in EBCDIC)

    Perhaps a pseudo-call would be better such as

    char_plus(collation, 'R', 1)

    where 'collation' would be used to determine what was the specified
    number of characters away from 'R'.

    Why should code like this be made illegal:

        a := a * 10 + (c - '0')

    I'm not saying it should. I am in two minds about what's best. The
    alternative is something like

    a := a * 10 + digit_value(c)


    Then I realised I shouldn't be telling the programmer what they can and
    can't do with characters, as there might be some perfectly valid
    use-case that I simply hadn't thought of.

    Agreed. Although the point of prohibiting arithmetic on characters is to
    make multilingual programming easier, not harder.

    ..

    Prohibiting arithmetic on them could be dome but would make
    classifying and manipulating characters difficult unless one had a
    comprehensive set of library functions such as

       is_digit(char)
       is_alphanum(locale, char)
       is_lower(locale, char)
       upper(locale, char)

    and many more.

    As I said, you and I don't know all the possibilites.

    Yes, the challenge of making multilingual programming easier is
    providing a comprehensive and convenient set of conversions.


    Of course there
    would need to be conversions between char and int, but this can become a nuisance.

    Aside from converting digits (and any other characters used as digits in
    a higher number base) is there's any meaning to converting chars to/from
    ints?


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to James Harris on Sat Nov 5 13:46:33 2022
    On 05/11/2022 11:14, James Harris wrote:
    On 04/11/2022 13:58, David Brown wrote:

    I mean you should consider a "string" to be a way of holding a
    sequence of "character" units which can hold a code unit of UTF-8
    (since any other choice of character encoding is madness).

    We've discussed before that (IMO) Unicode is useful for physical
    printing to paper or electronic rendering such as to PDF but that it's a nightmare for programmers and users when it is used for any kind of
    input so I won't go over that again except to say that AISI Unicode
    should be handled by library functions rather than a language.

    What I do have in mind is strings of 'containers' where a string might
    be declared as of type

      string of char8      -- meaning a string of char8 containers
      string of char32     -- meaning a string of char32 containers

    What goes in each 8-bit or 32-bit 'container' would be another matter.

    That agnostic ideal is somewhat in tension with the desire to include
    string literals in a program text. For that, as I've mentioned before,
    my preference is to have the program text and any literals within it
    written in ASCII and American English; supplementary files would express
    the string literals in other languages.

    For example,

      print "Hello world"

    would be accompanied by a file for French which included

      "Hello world" --> "Bonjour le monde"

    Naturally, multilingual programming is much more complex than that
    simple example but it shows the basic idea. The compiler would be able
    to check that language files had everything required for a given piece
    of source code.

    Is it? This pretty much all I did when I used to write internationalised applications. Although that was only done for French, German and Dutch.

    But that print example would be written like this:

    print /"Hello World"

    The "/" was a translation operator, so only certain strings were
    translated. This also made it easy to scan source code to build a list
    of messages, used to maintain the dictionary as entries were added,
    deleted or modified.

    The scheme did need some hints sometimes, written like this, to get
    around ambiguities:

    print /"Green!colour"
    print /"Green!fresh"

    The hint was usually filtered out.

    But this is little to do with how strings are represented. Even in
    English, messages may include characters like "£" (pound sign) which is
    not part of ASCII.

    So a way to represent Unicode within literals is still needed (didn't we discuss this a couple of years ago?).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to James Harris on Sat Nov 5 13:38:16 2022
    On 05/11/2022 10:36, James Harris wrote:
    On 04/11/2022 22:28, Bart wrote:

    I actually had such a restriction for a while: char*5 wasn't allowed,
    but char+1 was. After all why on earth shouldn't you want the next
    character in that alphabet?

    That's because 'R' + 1 may not be the next character in all alphabets. Defining 'next' is more than difficult. It depends on intended collation order which varies in different parts of the world and can even change
    over time as authorities choose different collation orders. Some
    plausible meanings of 'R' + 1:

      'S' (as in ASCII)

    You said elsewhere that you want to use ASCII within programs. Which is
    it happens, corresponds to the first 128 points of Unicode. Here:

    char c
    for c in 'A'..'Z' do
    print c
    od

    this displays 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'. The for-loop works by adding
    +1 to 'c'; it doesn't care about collating order!

    (This also illustrates the difference between `byte` and `char`; using a
    byte type, the output would be '656667...90'.)



      'r' (what the user may want as sort order)
      's' (what the user may want as sort order)
      a non-character (as in EBCDIC)

    My feeling is that it is these diverse requirements that require
    user-supplied functions.

    My 'char' is still a thinly veiled numeric type, so ordinary integer
    arithmatic can be used. Otherwise even something like this becomes
    impossible:

    ['A'..'Z']int histogram

    midpoint := (histogram.upb - histogram.lwb)/2

    ++histogram[midpoint+1]

    This requires certain properties of array indicates, like being able to
    do arithmetic, as well as being consecutive ordinal values.

    Perhaps a pseudo-call would be better such as

      char_plus(collation, 'R', 1)

    where 'collation' would be used to determine what was the specified
    number of characters away from 'R'.

    Sure, as I said, you can provide any interpretation you like. But if you
    do C+1, you expect to get the code of the next character (or next
    codepoint if venturing outside ASCII).



    Aside from converting digits (and any other characters used as digits in
    a higher number base) is there's any meaning to converting chars to/from ints?


    My static language makes byte and char slightly different types. (Types involving char may get printed differently.)

    That meant that `ref byte` and `ref char` were incompatible, which
    rapidly turned into a nightmare: I might have a readfile() routine that returned a `ref byte` type, a pointer to a block of memory.

    But I wanted to interpret that block as `ref char` - a string. So this
    meant loads of casts to either `ref byte` or `ref char` to get things to
    work, but it got too much (a bit like 'const poisoning' in C where it
    just propagates everywhere). That was clearly the wrong approach.

    In the end I relaxed the type rules so that `ref byte` and `ref char`
    are compatible, and everything is now SO much simpler.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Bart on Sat Nov 5 16:07:30 2022
    On 05/11/2022 13:38, Bart wrote:
    On 05/11/2022 10:36, James Harris wrote:
    On 04/11/2022 22:28, Bart wrote:

    I actually had such a restriction for a while: char*5 wasn't allowed,
    but char+1 was. After all why on earth shouldn't you want the next
    character in that alphabet?

    That's because 'R' + 1 may not be the next character in all alphabets.
    Defining 'next' is more than difficult. It depends on intended
    collation order which varies in different parts of the world and can
    even change over time as authorities choose different collation
    orders. Some plausible meanings of 'R' + 1:

       'S' (as in ASCII)

    You said elsewhere that you want to use ASCII within programs. Which is
    it happens, corresponds to the first 128 points of Unicode. Here:

        char c
        for c in 'A'..'Z' do
            print c
        od

    this displays 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'. The for-loop works by adding
    +1 to 'c'; it doesn't care about collating order!

    That piece of code is fine for an English-speaking user but to a
    Spaniard the alphabet has a character missing, and Greeks wouldn't agree
    with it at all.

    Where L is the locale why not allow something like

    for c in L.alpha_first..L.alpha_last
    print c
    od

    ?

    That should work in English or Spanish or Greek etc, shouldn't it?


    (This also illustrates the difference between `byte` and `char`; using a
    byte type, the output would be '656667...90'.)



       'r' (what the user may want as sort order)
       's' (what the user may want as sort order)
       a non-character (as in EBCDIC)

    My feeling is that it is these diverse requirements that require user-supplied functions.

    Functions, yes, (or what appear to be functions) though surely they
    should be part of a library that comes with the language.


    My 'char' is still a thinly veiled numeric type, so ordinary integer arithmatic can be used. Otherwise even something like this becomes impossible:

        ['A'..'Z']int histogram

        midpoint := (histogram.upb - histogram.lwb)/2

        ++histogram[midpoint+1]

    This requires certain properties of array indicates, like being able to
    do arithmetic, as well as being consecutive ordinal values.

    Why not

    [L.alpha_first..L.alpha_last] int histogram

    ?

    As for the calculations what about using L.ord and L.chr to convert
    between chars and integers?


    Perhaps a pseudo-call would be better such as

       char_plus(collation, 'R', 1)

    where 'collation' would be used to determine what was the specified
    number of characters away from 'R'.

    Sure, as I said, you can provide any interpretation you like. But if you
    do C+1, you expect to get the code of the next character (or next
    codepoint if venturing outside ASCII).

    If you use codepoints then you might not get the next character in
    sequence - as in the case of 'R' in ebcdic (you'd get a non-printing
    character) or 'N' in Spanish (you'd get 'O' rather than the N with a hat
    that a Spaniard would expect).

    If the programmer wants "the next character in the alphabet" then
    shouldn't the programming language or a standard library help him get
    that irrespective of the human language the program is meant to be
    processing?




    Aside from converting digits (and any other characters used as digits
    in a higher number base) is there's any meaning to converting chars
    to/from ints?


    My static language makes byte and char slightly different types. (Types involving char may get printed differently.)

    That meant that `ref byte` and `ref char` were incompatible, which
    rapidly turned into a nightmare: I might have a readfile() routine that returned a `ref byte` type, a pointer to a block of memory.

    But I wanted to interpret that block as `ref char` - a string. So this
    meant loads of casts to either `ref byte` or `ref char` to get things to work, but it got too much (a bit like 'const poisoning' in C where it
    just propagates everywhere). That was clearly the wrong approach.

    C's const propagation sounds like Java with its horrible, and sticky,
    exception propagation.


    In the end I relaxed the type rules so that `ref byte` and `ref char`
    are compatible, and everything is now SO much simpler.

    Would there have been any value in defining a layout for the untyped
    area of bytes (or parts thereof)? That's where I think I am headed.


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to James Harris on Sat Nov 5 17:51:01 2022
    On 04/11/2022 22:35, James Harris wrote:
    On 04/11/2022 17:50, Bart wrote:
    On 04/11/2022 13:58, David Brown wrote:

    Neither "byte" nor "character" should have any kind of arithmetic
    operators - they are not integers.  But you will need cast or
    conversion operations on them.

    Bytes are small integers, typically of u8 type.

    I can't see why arithmetic can't be done with them, unless you want a
    purer kind of language where arithmetic is only allowed on signed
    numbers, and bitwise ops only on unsigned numbers, which is usually
    going to be a pain for all concerned.

    I think what David means is that arithmetic operations don't apply to characters (even though some languages permit such operations). For
    example, neither

      'a' * 5

    nor even

      'R' + 1

    have any meaning over the set of characters. Prohibiting arithmetic on
    them could be dome but would make classifying and manipulating
    characters difficult unless one had a comprehensive set of library
    functions such as

      is_digit(char)
      is_alphanum(locale, char)
      is_lower(locale, char)
      upper(locale, char)

    and many more.



    Yes.

    You will, of course, need some kind of explicit conversion between strings/characters and integers or bytes. But if you want operations
    like "is_digit" or "upper" for more than plain ASCII, you need a
    comprehensive library. (As a fine example, the uppercase of "i" in
    English is the letter "I", while in Turkish it is the letter "İ".)

    There are plenty of internationalisation libraries available - you
    should be able to find something suitable (perhaps
    <https://icu.unicode.org>) and make wrappers and interfaces to your new language.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Bart on Sat Nov 5 17:41:59 2022
    On 04/11/2022 18:50, Bart wrote:
    On 04/11/2022 13:58, David Brown wrote:

    Neither "byte" nor "character" should have any kind of arithmetic
    operators - they are not integers.  But you will need cast or
    conversion operations on them.

    Bytes are small integers, typically of u8 type.

    That's true in some languages. In other languages they are character
    types. (In C, they are both.)

    My suggestion is that they should not be considered integers or
    characters, but a low-level "raw data" type. A "byte" should represent
    a byte of memory, or an octet in a network packet, or a byte of a file.
    It doesn't make sense to do any arithmetic on it because it might be
    part of some completely different data type.

    A given "byte" might be part of the storage of a floating point number,
    or the storage of an address or pointer, or anything else.

    I've said this before - the strength of a programming language is mainly determined by what you /cannot/ do, rather than what you /can/ do.
    Design a language to make it as hard as possible to get things wrong,
    while still being easy to do the things you want it to do. Keeping
    integer types, character types, strings, and raw data independent can
    help with that.


    I can't see why arithmetic can't be done with them, unless you want a
    purer kind of language where arithmetic is only allowed on signed
    numbers, and bitwise ops only on unsigned numbers, which is usually
    going to be a pain for all concerned.


    Whether you have signed and unsigned integer types, what sizes you have, whether you have an abstract "integer" type and explicitly ranged
    subtypes (as in Ada) is another matter.

    The concept of "signed char" and "unsigned char" in C is a serious
    design flaw.  A type designed to hold letters should not have a sign,
    and should not be used to hold arbitrary raw, low-level data.

    Signed and unsigned chars are not so bad;

    Yes, they are bad. A "character" is a letter or other visual symbol.
    Can you explain the difference between "a positive letter X" and "a
    negative letter X" ? Of course you can't - it is utter nonsense. The
    same goes for adding 6 to the Greek letter µ, or multiplying Ð by Å.
    Even operations that might appear sensible in code, such as adding 1 to
    a char, don't actually make sense - it is the operation of taking the
    next letter in the alphabet that makes sense.

    presumably C intended these to
    do the job of a 'byte' type for small integers. So it was just a poor
    choice of name. (After all there is no separate type in C for bytes
    holding character data.)

    What's bad is that third kind: a 'plain char' type, which is
    incompatible with both signed and unsigned char, even though it
    necessarily needs to be one of the other on a specific platform. It
    occurs in no other language, and causes problems within FFI APIs.


    Certainly having three different "char" types in C is bad. But the only sensible choice is to have nothing but a plain "char" and use /integer/
    types for numeric data. (Call them "u8" and "i8" if you prefer, rather
    than the C names "uint8_t" and "int8_t".)

    (These days I would consider not having any kind of character type at
    all, unless it was a language targeting small embedded systems that need maximal efficiency - by the time you have something that can hold any
    UTF-8 character, you might as well just call it a "string".)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Bart on Sat Nov 5 16:18:49 2022
    On 05/11/2022 13:46, Bart wrote:
    On 05/11/2022 11:14, James Harris wrote:

    ..

    For example,

       print "Hello world"

    would be accompanied by a file for French which included

       "Hello world" --> "Bonjour le monde"

    Naturally, multilingual programming is much more complex than that
    simple example but it shows the basic idea. The compiler would be able
    to check that language files had everything required for a given piece
    of source code.

    Is it? This pretty much all I did when I used to write internationalised applications. Although that was only done for French, German and Dutch.

    I imagine multilingual programming would be very difficult and that it's something a language should help with but it's not something I have had
    to do as yet.


    But that print example would be written like this:

        print /"Hello World"

    The "/" was a translation operator, so only certain strings were
    translated. This also made it easy to scan source code to build a list
    of messages, used to maintain the dictionary as entries were added,
    deleted or modified.

    The scheme did need some hints sometimes, written like this, to get
    around ambiguities:

        print /"Green!colour"
        print /"Green!fresh"

    That's cool. I have something similar which is trailing identifiers but
    I had them down as specific rather than hints. For example,

    print "Green" :GreenColor
    print "Green" :GreenFresh

    Then the accompanying language files could distinguish between the strings.


    The hint was usually filtered out.

    But this is little to do with how strings are represented. Even in
    English, messages may include characters like "£" (pound sign) which is
    not part of ASCII.

    I have two ways to deal with that. If the program is to use a pound sign
    in England but a Dollar sign in America, say, then the source would have
    a dollar sign and there'd be a file to translate for use in England.

    On the other hand, if the program was to use a pound sign in all cases
    then I'd do as below.


    So a way to represent Unicode within literals is still needed (didn't we discuss this a couple of years ago?).


    Yes. You may remember my preference was for a named character something like

    pound_currency_string = "\PoundSterling/"


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Bart on Sat Nov 5 18:01:17 2022
    On 04/11/2022 23:28, Bart wrote:
    On 04/11/2022 21:35, James Harris wrote:
    On 04/11/2022 17:50, Bart wrote:
    On 04/11/2022 13:58, David Brown wrote:

    Neither "byte" nor "character" should have any kind of arithmetic
    operators - they are not integers.  But you will need cast or
    conversion operations on them.

    Bytes are small integers, typically of u8 type.

    I can't see why arithmetic can't be done with them, unless you want a
    purer kind of language where arithmetic is only allowed on signed
    numbers, and bitwise ops only on unsigned numbers, which is usually
    going to be a pain for all concerned.

    I think what David means is that arithmetic operations don't apply to
    characters

    I was picking on the 'byte' type; it seems extraordinary that you
    shouldn't be allowed to do arithmetic with them. If you can initialise a
    byte value with a number like this:

        byte a = 123

    then it's a number!


    If I were making the language, then you could not do such an
    initialisation. A "byte" is raw data, not a number.

    (even though some languages permit such operations). For example, neither

       'a' * 5

    nor even

       'R' + 1

    have any meaning over the set of characters.

    I actually had such a restriction for a while: char*5 wasn't allowed,
    but char+1 was. After all why on earth shouldn't you want the next
    character in that alphabet? Why should code like this be made illegal:

        a := a * 10 + (c - '0')

    Why not :

    char a; // Use whatever syntax you prefer
    int i; // and whatever type names you prefer

    a = digit(i);

    The function "digit" might be defined :

    char digit(int i) {
    return char(i + ord('0'));
    }

    You want to find the next letter after "x"? "char(ord(x) + 1)". Or
    perhaps, like Pascal and Ada, "succ(x)".

    A language has to let you do what you need to do - but you should be
    required to write it /clearly/. It should not let you mix apples and
    oranges without you saying exactly how you think apples and oranges
    should be mixed in the given situation.


    Then I realised I shouldn't be telling the programmer what they can and
    can't do with characters, as there might be some perfectly valid
    use-case that I simply hadn't thought of.

    Maybe 'a' * 5 yields the value 'aaaaa' or the string "aaaaa", or this is
    some kind on encryption algorithm.

    You've hit the nail on the head. What does it mean to write "'a' * 5" ?
    To some people it means one thing, to others it means something
    different, and to most people in most circumstances it is meaningless. Meaningless code should be a compile-time error. And you let the
    programmer write /explicit/ code to say what he/she intends in other cases.

    It is only if something is being used so often that explicit code is a
    pain to read and write, that you should do anything implicitly here. So
    if you are writing a language designed primarily for string processing,
    you might consider having "'a' * 5" defined to be "aaaaa". Otherwise,
    let the programmer write "repeat('a', 5)" or "'a'.repeat(5)", or
    whatever suits the style of the language.


    So now they are treated like integers, other than printing an array of
    char or pointer to char assumes they are strings.

    Prohibiting arithmetic on them could be dome but would make
    classifying and manipulating characters difficult unless one had a
    comprehensive set of library functions such as

       is_digit(char)
       is_alphanum(locale, char)
       is_lower(locale, char)
       upper(locale, char)

    and many more.

    As I said, you and I don't know all the possibilites. Of course there
    would need to be conversions between char and int, but this can become a nuisance.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to James Harris on Sat Nov 5 18:08:37 2022
    On 05/11/2022 17:07, James Harris wrote:
    On 05/11/2022 13:38, Bart wrote:
    On 05/11/2022 10:36, James Harris wrote:
    On 04/11/2022 22:28, Bart wrote:

    I actually had such a restriction for a while: char*5 wasn't
    allowed, but char+1 was. After all why on earth shouldn't you want
    the next character in that alphabet?

    That's because 'R' + 1 may not be the next character in all
    alphabets. Defining 'next' is more than difficult. It depends on
    intended collation order which varies in different parts of the world
    and can even change over time as authorities choose different
    collation orders. Some plausible meanings of 'R' + 1:

       'S' (as in ASCII)

    You said elsewhere that you want to use ASCII within programs. Which
    is it happens, corresponds to the first 128 points of Unicode. Here:

         char c
         for c in 'A'..'Z' do
             print c
         od

    this displays 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'. The for-loop works by
    adding +1 to 'c'; it doesn't care about collating order!

    That piece of code is fine for an English-speaking user but to a
    Spaniard the alphabet has a character missing, and Greeks wouldn't agree
    with it at all.

    Where L is the locale why not allow something like

      for c in L.alpha_first..L.alpha_last
        print c
      od

    ?

    That should work in English or Spanish or Greek etc, shouldn't it?

    Yes. Of course, it would not work so well in Chinese (where there is no concept of alphabet), and would be a challenge for Arabic (where the
    shape of letters varies enormously depending on where they come in
    words). But it is definitely a step in the right direction. And makes
    the code independent of the character encoding.



    (This also illustrates the difference between `byte` and `char`; using
    a byte type, the output would be '656667...90'.)



       'r' (what the user may want as sort order)
       's' (what the user may want as sort order)
       a non-character (as in EBCDIC)

    My feeling is that it is these diverse requirements that require
    user-supplied functions.

    Functions, yes, (or what appear to be functions) though surely they
    should be part of a library that comes with the language.


    How much of a library you provide depends on the goals of the language.
    You don't have to provide /everything/ !


    My 'char' is still a thinly veiled numeric type, so ordinary integer
    arithmatic can be used. Otherwise even something like this becomes
    impossible:

         ['A'..'Z']int histogram

         midpoint := (histogram.upb - histogram.lwb)/2

         ++histogram[midpoint+1]

    This requires certain properties of array indicates, like being able
    to do arithmetic, as well as being consecutive ordinal values.

    Why not

      [L.alpha_first..L.alpha_last] int histogram

    ?

    As for the calculations what about using L.ord and L.chr to convert
    between chars and integers?


    Perhaps a pseudo-call would be better such as

       char_plus(collation, 'R', 1)

    where 'collation' would be used to determine what was the specified
    number of characters away from 'R'.

    Sure, as I said, you can provide any interpretation you like. But if
    you do C+1, you expect to get the code of the next character (or next
    codepoint if venturing outside ASCII).

    If you use codepoints then you might not get the next character in
    sequence - as in the case of 'R' in ebcdic (you'd get a non-printing character) or 'N' in Spanish (you'd get 'O' rather than the N with a hat
    that a Spaniard would expect).

    If the programmer wants "the next character in the alphabet" then
    shouldn't the programming language or a standard library help him get
    that irrespective of the human language the program is meant to be processing?




    Aside from converting digits (and any other characters used as digits
    in a higher number base) is there's any meaning to converting chars
    to/from ints?


    My static language makes byte and char slightly different types.
    (Types involving char may get printed differently.)

    That meant that `ref byte` and `ref char` were incompatible, which
    rapidly turned into a nightmare: I might have a readfile() routine
    that returned a `ref byte` type, a pointer to a block of memory.

    But I wanted to interpret that block as `ref char` - a string. So this
    meant loads of casts to either `ref byte` or `ref char` to get things
    to work, but it got too much (a bit like 'const poisoning' in C where
    it just propagates everywhere). That was clearly the wrong approach.

    C's const propagation sounds like Java with its horrible, and sticky, exception propagation.


    Getting "const" right is something to think long and hard about. When
    do you mean "constant", when do you mean "read-only", when do you mean
    "I promise this data will never change", "I will assume this data will
    never change", "I promise /I/ won't change this data via this
    reference", "This data will be unchanged logically but may change in
    underlying representation, such as using a cache of some sort", etc. ?

    Constness is a hugely powerful concept, and something you definitely
    want in a language. Modern language design fashion is to making things constant be default and require explicit indication that they can
    change. Some programming languages (pure functional programming
    languages, for example) have /only/ constant data - there is no such
    thing as variables.


    In the end I relaxed the type rules so that `ref byte` and `ref char`
    are compatible, and everything is now SO much simpler.

    Would there have been any value in defining a layout for the untyped
    area of bytes (or parts thereof)? That's where I think I am headed.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to David Brown on Sun Nov 6 09:28:27 2022
    On 05/11/2022 17:01, David Brown wrote:
    On 04/11/2022 23:28, Bart wrote:

    ..

    I actually had such a restriction for a while: char*5 wasn't allowed,
    but char+1 was. After all why on earth shouldn't you want the next
    character in that alphabet? Why should code like this be made illegal:

         a := a * 10 + (c - '0')

    Why not :

        char a;        // Use whatever syntax you prefer
        int i;        // and whatever type names you prefer

        a = digit(i);

    The function "digit" might be defined :

        char digit(int i) {
            return char(i + ord('0'));
        }

    Wouldn't char and ord need a locale?

    That may be the wrong term but by locale I mean a bundled set of rules (including, in this case, what the digits are and how many there are of
    them) which apply to the language and region the program is executing for.

    Maybe it is the right term. I see on Wikipedia: "In computing, a locale
    is a set of parameters that defines the user's language, region and any
    special variant preferences that the user wants to see in their user interface."

    https://en.wikipedia.org/wiki/Locale_(computer_software)


    You want to find the next letter after "x"?  "char(ord(x) + 1)".  Or perhaps, like Pascal and Ada, "succ(x)".

    pred and succ are great, and I was thinking to start a thread on how
    they might be used for different data types. But I have to point out
    that they are not enough on their own. If a user wanted the character
    ten away from the current one then he wouldn't want to code ten succ operations.


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to David Brown on Sun Nov 6 09:17:24 2022
    On 05/11/2022 17:08, David Brown wrote:
    On 05/11/2022 17:07, James Harris wrote:
    On 05/11/2022 13:38, Bart wrote:

    ..

    My feeling is that it is these diverse requirements that require
    user-supplied functions.

    Functions, yes, (or what appear to be functions) though surely they
    should be part of a library that comes with the language.


    How much of a library you provide depends on the goals of the language.
    You don't have to provide /everything/ !

    That's good to hear. :)

    While I am trying to make it easy to invoke functions written by other
    people ISTM that it's also right for the language to have associated
    with it a load of standard provisions - i18n support, various data
    structures, display support, maths libraries, etc for one simple reason:
    code maintenance; it's easier to maintain software which uses library
    calls one already knows than to have to learn yet another set of i18n
    calls, for example.

    ..

    C's const propagation sounds like Java with its horrible, and sticky,
    exception propagation.


    Getting "const" right is something to think long and hard about.  When
    do you mean "constant", when do you mean "read-only", when do you mean
    "I promise this data will never change", "I will assume this data will
    never change", "I promise /I/ won't change this data via this
    reference", "This data will be unchanged logically but may change in underlying representation, such as using a cache of some sort", etc. ?

    That sounds really interesting and I'd like to get in to it but this is
    not the thread. If you wanted to start a new thread on the topic I would
    reply. Suffice to say here that I don't use "const" but do have "ro" and
    "rw" as usable in various contexts which effect many of the things you
    mention but I don't know if I have covered everything a programmer might
    need.


    Constness is a hugely powerful concept, and something you definitely
    want in a language.  Modern language design fashion is to making things constant be default and require explicit indication that they can
    change.  Some programming languages (pure functional programming
    languages, for example) have /only/ constant data - there is no such
    thing as variables.

    I haven't gone that far but, for example, I have globals as, by default,
    read only and while parameters are read-write a function would have to
    keep the originals around if there's a chance they would be needed.

    As I say, though, such things need a thread of their own so I'll resist
    the urge to say more.


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to James Harris on Sun Nov 6 12:10:07 2022
    On 06/11/2022 10:17, James Harris wrote:
    On 05/11/2022 17:08, David Brown wrote:
    On 05/11/2022 17:07, James Harris wrote:
    On 05/11/2022 13:38, Bart wrote:

    ..

    My feeling is that it is these diverse requirements that require
    user-supplied functions.

    Functions, yes, (or what appear to be functions) though surely they
    should be part of a library that comes with the language.


    How much of a library you provide depends on the goals of the
    language. You don't have to provide /everything/ !

    That's good to hear. :)

    While I am trying to make it easy to invoke functions written by other
    people ISTM that it's also right for the language to have associated
    with it a load of standard provisions - i18n support, various data structures, display support, maths libraries, etc for one simple reason:
    code maintenance; it's easier to maintain software which uses library
    calls one already knows than to have to learn yet another set of i18n
    calls, for example.

    Sure. But pick one i18n library, write the FFI wrapper, and call that
    your standard library. Then users don't have to deal with third-party
    code and libraries, and you don't have to learn the intricacies of how
    to support multiple languages properly (you don't have enough lifetimes
    to learn enough to write it yourself). Everyone wins!


    ..

    C's const propagation sounds like Java with its horrible, and sticky,
    exception propagation.


    Getting "const" right is something to think long and hard about.  When
    do you mean "constant", when do you mean "read-only", when do you mean
    "I promise this data will never change", "I will assume this data will
    never change", "I promise /I/ won't change this data via this
    reference", "This data will be unchanged logically but may change in
    underlying representation, such as using a cache of some sort", etc. ?

    That sounds really interesting and I'd like to get in to it but this is
    not the thread. If you wanted to start a new thread on the topic I would reply. Suffice to say here that I don't use "const" but do have "ro" and
    "rw" as usable in various contexts which effect many of the things you mention but I don't know if I have covered everything a programmer might need.


    Constness is a hugely powerful concept, and something you definitely
    want in a language.  Modern language design fashion is to making
    things constant be default and require explicit indication that they
    can change.  Some programming languages (pure functional programming
    languages, for example) have /only/ constant data - there is no such
    thing as variables.

    I haven't gone that far but, for example, I have globals as, by default,
    read only and while parameters are read-write a function would have to
    keep the originals around if there's a chance they would be needed.

    As I say, though, such things need a thread of their own so I'll resist
    the urge to say more.


    Fair enough. It's a big topic, and deserves its own thread. All I will
    do here is encourage you to think hard about it, learn about it, and
    test ideas early on - if you try to add "const" to a language later, it
    will inevitably be a mess, complicated and inconsistent.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to James Harris on Sun Nov 6 12:23:32 2022
    On 06/11/2022 10:28, James Harris wrote:
    On 05/11/2022 17:01, David Brown wrote:
    On 04/11/2022 23:28, Bart wrote:

    ..

    I actually had such a restriction for a while: char*5 wasn't allowed,
    but char+1 was. After all why on earth shouldn't you want the next
    character in that alphabet? Why should code like this be made illegal:

         a := a * 10 + (c - '0')

    Why not :

         char a;        // Use whatever syntax you prefer
         int i;        // and whatever type names you prefer

         a = digit(i);

    The function "digit" might be defined :

         char digit(int i) {
             return char(i + ord('0'));
         }

    Wouldn't char and ord need a locale?

    That would depend on what you are trying to do. If you wanted a real multi-lingual "digit" function, then yes - and you'd return different
    Unicode characters for different languages. But it is also important to
    have some way of getting to the underlying representation of the
    characters. At a minimum, you'll need that to implement the library of functions for dealing with locales. (Maybe you want to distinguish
    between "low-level" or "library implementation" code that is allowed to
    do such things, and "user" code that is not - just as some languages
    have "safe" and "unsafe" code modes.)


    That may be the wrong term but by locale I mean a bundled set of rules (including, in this case, what the digits are and how many there are of
    them) which apply to the language and region the program is executing for.

    Maybe it is the right term. I see on Wikipedia: "In computing, a locale
    is a set of parameters that defines the user's language, region and any special variant preferences that the user wants to see in their user interface."

      https://en.wikipedia.org/wiki/Locale_(computer_software)


    You want to find the next letter after "x"?  "char(ord(x) + 1)".  Or
    perhaps, like Pascal and Ada, "succ(x)".

    pred and succ are great, and I was thinking to start a thread on how
    they might be used for different data types. But I have to point out
    that they are not enough on their own. If a user wanted the character
    ten away from the current one then he wouldn't want to code ten succ operations.


    You could give "pred" and "succ" an optional step argument.

    You will also have to think about what happens if the result doesn't
    make sense - if you step beyond the range for the type, do you throw an
    error of some sort? Should some kinds of types have wrapping succ/pred operations?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to David Brown on Sun Nov 13 20:49:22 2022
    On 06/11/2022 11:10, David Brown wrote:
    On 06/11/2022 10:17, James Harris wrote:
    On 05/11/2022 17:08, David Brown wrote:

    ...

    How much of a library you provide depends on the goals of the
    language. You don't have to provide /everything/ !

    That's good to hear. :)

    While I am trying to make it easy to invoke functions written by other
    people ISTM that it's also right for the language to have associated
    with it a load of standard provisions - i18n support, various data
    structures, display support, maths libraries, etc for one simple
    reason: code maintenance; it's easier to maintain software which uses
    library calls one already knows than to have to learn yet another set
    of i18n calls, for example.

    Sure.  But pick one i18n library, write the FFI wrapper, and call that
    your standard library.  Then users don't have to deal with third-party
    code and libraries, and you don't have to learn the intricacies of how
    to support multiple languages properly (you don't have enough lifetimes
    to learn enough to write it yourself).  Everyone wins!

    Yes, internationalisation is an enormous area. I don't have anything
    like the requisite knowledge of the world's languages and character sets
    to do the work. What I /can/ do, AISI, is to establish principles and restrictions intended to make multilingual programming more natural. For example,

    * To define the standard form of strings to have unadorned characters
    stored as separate codes from diacritics (which I gather may be called combining characters).

    * Diacritic codes would all have to be stored in a certain order
    relative to each other.

    * String encodings would put diacritics before the character to which
    they apply. (Though am not sure what to do about accents which apply to
    whole words or groups of characters.)

    * The language would not support ordering of characters based on their
    internal codes. All ordering would require a locale to indicate which
    should come first.

    * String ordering would require creation of a 'comparison string'
    created according to the rules of a selected locale. The internal codes
    of the comparison string /would/ be comparable for orde5ring.

    * There would be a standard API which all i18n libraries would have to
    support.

    etc

    Whether that kind of approach is valid or not, I don't know. It's just
    my best guess at what may be required.

    BTW, emoticons are a complete nightmare. There could be any number of
    them, they may render differently on different devices and no ordering
    of them makes sense.


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)