Forum: >>> Magnum BBS <<<

Inefficiency of FSL matrices

From Krishna Myneni@21:1/5 to All on Wed Dec 13 18:38:06 2023

The array and matrix definitions given in the FSL utilities source,

fsl-util.x
or
fsl_util.x.

The arrays are quite nice for their flexibility in making arrays out of
any type of data, and are useful in many instances. However, the source definitions are slow on non-optimizing Forth systems.

I believe the design of the arrays and matrices traces back to Julian
Noble's, "Scientific Forth." A 1-D array is named with a trailing left
brace "{" while 2-D matrices have a trailing double left brace, "{{".
This allows a convenient notation

a{ I } \ resolves to address of array element I
m{{ J I }} \ resolves to address of matrix element at row J, col I

The indices are 0-origin, so the first element of the array named a{ is
stored at address

a{ 0 }

and the top left element of the matrix named m{{ is stored at address

m{{ 0 0 }}

Creation of fixed size arrays and matrices is performed with the parsing
words

ARRAY
MATRIX

and the size of each element is one of the arguments on the stack, so, typically, we can write

36 FLOAT ARRAY a{ \ create array named "a{" with 36 elements of size
FLOAT (a constant, usually 8 bytes)

120 48 INTEGER MATRIX m{{ \ create matrix named "m{{" with 120 rows and
48 columns of size INTEGER (a cell size in bytes)

There are provisions for creating dynamically allocated arrays as well.

The operators "}" and "}}" resolve the adress of the indexed element of
the specified array, and this is where the inefficiency lies. I've been
tempted to make "}" and "}}" machine code primitives for these words to increase their speed in kForth. It seems unlikely though that the words
"}" and "}}" would be generally adopted as common use words, so I have
not made them built-in words.

To give you an idea of the efficiency hit from using these words versus
custom arrays of fixed size elements, I tested numerically intensive
code using a custom matrix of double precision floats, with index and
fetch, and index and store words:
]]F@
]]F!
The code and definitions for the two words above are found in the
program, pde1.4th, at the link provided.

Next, I created a version of pde1.4th which used FSL matrices and no
extra optimizations, with the notation described earlier e.g. for a
floating point matrix,

m{{ J I }} F@ and m{{ J I }} F!

Some typical execution times with the custom matrix fetch and store
words versus the typical FSL matrix word usage are,

Custom matrices: 9656 ms
FSL matrices: 20474 ms

The above timings use kForth-64 to time the word SOLVE for the
RECTangular boundary value problem, with a specified tolerance of 1E-16.

kforth32 has similar timings (kforth32-fast performs much faster).

Some optimization is clearly needed on the FSL array and matrix definitions.

The FSL definitions of ARRAY and MATRIX in kForth may be found in its fsl-util.4th file at the link given below.

--
Krishna Myneni

Links to code mentioned above:

https://github.com/mynenik/kForth-64/blob/master/forth-src/pde1.4th

https://github.com/mynenik/kForth-64/blob/master/forth-src/fsl/fsl-util.4th

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to All on Thu Dec 14 12:44:32 2023

Interesting, but the performance gain is not surprising
if you use matrix-adapted primitives.

Just for comparison:
I use different internal representations for vectors and
matrices. Vectors are simple fp-number fields. Matrices
in the heap are fat data and have a header for dimension,
numeric type and reference count. Matrices can only be
referenced via matrix-values (or matrix-stack elements).

The compiler automatically generates the correct and
therefore performant type-dependent code for
<indices> } -> matrix element
<indices> }^ -> address of the element
<indices> }! -> write matrix element

This is of course not FSL syntax, but more readable IMO.
There are also slices, but that is not the topic here.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to minforth on Thu Dec 14 09:51:44 2023

On 12/14/23 06:44, minforth wrote:

Interesting, but the performance gain is not surprising
if you use matrix-adapted primitives.

Just for comparison:
I use different internal representations for vectors and
matrices. Vectors are simple fp-number fields. Matrices
in the heap are fat data and have a header for dimension,
numeric type and reference count. Matrices can only be
referenced via matrix-values (or matrix-stack elements).

How do you implement vectors of different types other than fp-numbers?
Do you use a full-blown matrix then?

The compiler automatically generates the correct and
therefore performant type-dependent code for

In the FSL, this can be done for statically alloted arrays and matrices,
when the "}" and "}}" words can determine the element size (+number of
columns for a matrix) at compile-time, by making them IMMEDIATE words.

I don't see how the element-size-specific reference can be done for
dynamic arrays and matrices at compile-time, when the element size and
number of columns are not known.

Another drawback would be having to make "}" and "}}" state-dependent
words (on single-xt systems).

<indices> } -> matrix element
<indices> }^ -> address of the element
<indices> }! -> write matrix element

This is of course not FSL syntax, but more readable IMO.
There are also slices, but that is not the topic here.

I prefer the symmetry of the FSL syntax. If you used }@ it would be more symmetric but you might not like the added visual noise. Again, it would
not be possible for }@ to compile an element-size-specific reference on dynamically declared matrices.

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to Krishna Myneni on Thu Dec 14 10:27:19 2023

On 12/14/23 09:51, Krishna Myneni wrote:

On 12/14/23 06:44, minforth wrote:

...

The compiler automatically generates the correct and
therefore performant type-dependent code for

In the FSL, this can be done for statically alloted arrays and matrices,
when the "}" and "}}" words can determine the element size (+number of columns for a matrix) at compile-time, by making them IMMEDIATE words.

I don't see how the element-size-specific reference can be done for
dynamic arrays and matrices at compile-time, when the element size and
number of columns are not known.

Correction: For FSL dynamic arrays and matrices the element size is
known at compile time. For FSL dynamic matrices the number of columns is
not known at compile time, so it would always have to retrieved at run-time.

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to All on Fri Dec 15 08:50:38 2023

The details of the implementation would go too far here. Essentially,
my matrices are objects with methods for reading and writing cells
(and slices). There is a matrix stack with pointers to a pre-allocated
pool of empty matrix objects in the heap at the start of the programme.
Other frequently used global matrix values are also predefined.
This allows a lot of early-binding during compilation.

In Forth jargon, the matrices do not have single cell execution tokens,
but structs that carry type information (and other states). It is
therefore not a standard Forth. Available primitive data types are
integers, float24, float32, float64 (and compound types with companion bits).

As visual syntax example an older PLU factorisation (today primitive):

: MPLU ( m: m -- m: l p u ) PLU factorization
?square mrows meye mswap mrows dup mzeros mswap ( p l u ) mrows
FOR n n mcpivot v0~ IF -408 throw THEN
dup n = IF drop ELSE mrot dup n mrswap mrot dup n mrswap mrot n mrswap THEN
m[ n n ] 1. m'[ n n ]! mrows n 1+
?DO m[ i n ]^ dup v@ fover f/ fdup i n mrmulsub 0. v! m'[ i n ]!
LOOP fdrop
NEXT ;

(M[ means the top, M'[ the second matrix matrix on the matrix stack, and
as you can see, floats have been assumed here)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mhx@21:1/5 to minforth on Fri Dec 15 10:47:21 2023

minforth wrote:

The details of the implementation would go too far here. Essentially,
my matrices are objects with methods for reading and writing cells
(and slices). There is a matrix stack with pointers to a pre-allocated
pool of empty matrix objects in the heap at the start of the programme.
Other frequently used global matrix values are also predefined.
This allows a lot of early-binding during compilation.

This is also the approach I have chosen for iForth. However, there I have
no need for a matrix stack (the parameter stack is perfectly well
suited to handle the matrix/array struct pointers).
I also have a pool of "local" matrix objects. It is possible to write
words for e.g. matrix inversion that work independent of type (byte,
word, cell, sfloat, dfloat, float, xfloat, ddouble, complex, arbitrary precision ...)

I kept the FSL notation A{{ }} B{ } but there are new words to
access the struct fields: A{{ DIMS , B{ DADDR C{{ DSIZE etc..

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to mhx on Fri Dec 15 11:19:04 2023

mhx wrote:

minforth wrote:

The details of the implementation would go too far here. Essentially,
my matrices are objects with methods for reading and writing cells
(and slices). There is a matrix stack with pointers to a pre-allocated
pool of empty matrix objects in the heap at the start of the programme.
Other frequently used global matrix values are also predefined.
This allows a lot of early-binding during compilation.

This is also the approach I have chosen for iForth. However, there I have
no need for a matrix stack (the parameter stack is perfectly well
suited to handle the matrix/array struct pointers).

Yes, that works. The mstack is also there for historical reasons, because matrix pointers used to be fat pointers. With the introduction of copy-on-write,
this was no longer suitable. In addition, the normal integer and fp data stacks are full enough with other things, so the mstack is also useful for clearer code.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mhx@21:1/5 to minforth on Fri Dec 15 11:28:46 2023

minforth wrote:
[..]

In addition, the normal integer and fp data stacks
are full enough with other things, so the mstack is also useful for clearer code.

It probably comes as no surprise that I find locals to be a perfect solution for that.

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to Krishna Myneni on Fri Dec 15 08:55:33 2023

On 12/13/23 18:38, Krishna Myneni wrote:
...

The operators "}" and "}}" resolve the adress of the indexed element of
the specified array, and this is where the inefficiency lies. I've been tempted to make "}" and "}}" machine code primitives for these words to increase their speed in kForth. It seems unlikely though that the words
"}" and "}}" would be generally adopted as common use words, so I have
not made them built-in words.

I think the best approach forward for me, in kForth, is to include
efficient primitives in Forth system,

FSL_ARRAY_EL_REF ( a_arr nidx -- )
FSL_MATRIX_EL_REF ( a_mat nidx midx -- )

Then, in fsl-util.4th, "}" and "}}" may be declared to be synonyms for
these primitives, or something else if the user prefers a different implementation of arrays and matrices.

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to Krishna Myneni on Sun Dec 17 15:00:16 2023

On 12/15/23 08:55, Krishna Myneni wrote:

On 12/13/23 18:38, Krishna Myneni wrote:
...

The operators "}" and "}}" resolve the adress of the indexed element
of the specified array, and this is where the inefficiency lies. I've
been tempted to make "}" and "}}" machine code primitives for these
words to increase their speed in kForth. It seems unlikely though that
the words "}" and "}}" would be generally adopted as common use words,
so I have not made them built-in words.

I think the best approach forward for me, in kForth, is to include
efficient primitives in Forth system,

FSL_ARRAY_EL_REF ( a_arr nidx -- )
FSL_MATRIX_EL_REF ( a_mat nidx midx -- )

...

I'm implementing "}}" as an intrinsic word in kForth, for use with FSL matrices. In order to test new implementations and benchmark them
against different versions including the original source code version
from the kForth fsl-util.4th file, I wrote a benchmark program which
also performs elementary tests on "}}". The existing source code version
of "}}" for FSL-style, row-ordered matrices, is

: }} ( addr i j -- addr[i][j] )
>R >R
DUP [ 2 CELLS ] LITERAL - 2@ \ &a[0][0] size m
R> * R> + * + ;

In addition to the automated tests on the implementation of "}}", the
benchmark program generates a random 256x256 floating point matrix and
can perform a specified number of fetches from random row and column
locations -- a typical number would be 1,000,000 random accesses from
the matrix.

--- example use ---
include bench-fsl-matrix

FSL-UTIL V1.3c 11 Sep 2021 EFC, KM

TESTING }}

Filling 256x256 floating point matrix with random values ...
Benchmark: "ms@ 1000000 random-matrix-access ms@ swap - ." ok
ok

ms@ 1000000 random-matrix-access ms@ swap - .
454 ok
--- end of example ---

The testing and benchmark program is given below.

--- start of program ---
\ bench-fsl-matrix.4th
\
\ Benchmark the implementation of }}
\
\ Krishna Myneni, 2023-12-17

include ans-words
include random
include ttester
include fsl/fsl-util

\ Basic tests on }}
4 constant NROWS
16 constant NCOLS
NROWS NCOLS float matrix b{{
cr
TESTING }}
t{ b{{ b{{ 0 0 }} = -> true }t
t{ b{{ float+ b{{ 0 1 }} = -> true }t
t{ b{{ NCOLS floats + b{{ 1 0 }} = -> true }t
t{ b{{ NROWS NCOLS * 1- floats + b{{ NROWS 1- NCOLS 1- }} = -> true }t

\ Benchmarking matrix is 256x256
256 constant NROWS
NROWS constant NCOLS
NROWS NCOLS float matrix m{{

ms@ seed ! \

From Krishna Myneni@21:1/5 to Krishna Myneni on Sun Dec 17 21:36:18 2023

On 12/17/23 15:00, Krishna Myneni wrote:
...

: random-matrix-access ( -- )
0 ?DO m{{ random-row-col }} drop LOOP ;

...
This doesn't measure what I thought it was measuring. The benchmarking
of RANDOM-MATRIX-ACCESS measures the efficiency of RANDOM-ROW-COL rather
than measuring the efficiency of "}}". Using fixed row and column
numbers as indices gives a more accurate benchmark of "}}" which
computes the address of the specified element.

-- KM

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to Krishna Myneni on Sun Dec 17 21:58:37 2023

On 12/15/23 08:55, Krishna Myneni wrote:

On 12/13/23 18:38, Krishna Myneni wrote:
...

The operators "}" and "}}" resolve the adress of the indexed element
of the specified array, and this is where the inefficiency lies. I've
been tempted to make "}" and "}}" machine code primitives for these
words to increase their speed in kForth. ...

I think the best approach forward for me, in kForth, is to include
efficient primitives in Forth system,

...
Experimenting with an assembly language implementation of "}}" I find
that the following code in kforth64's vm64.s gives me a factor of 2x
speedup over the Forth source code for double-precision fp matrices.
This factor of 2 increase should not depend on the size of the matrix
element e.g., it should hold even if each matrix element was a
quad-precision floating point number (16 bytes) or a larger data
structure -- I still have to test this speedup under such conditions.

--
KM

L_fsl_mat_addr:
LDSP
INC_DSP
mov (%rbx), %rcx # rcx = j (column index)
INC_DSP
STSP
mov (%rbx), %rdx # rdx = i (row index)
mov WSIZE(%rbx), %rax # adress of first element
sub $2*WSIZE, %rax # rax = a - 2 cells
mov %rax, %rdi
mov (%rax), %rax # rax = ncols
imulq %rdx # rax = i*ncols
add %rax, %rcx # rcx = i*ncols + j
mov %rdi, %rax
add $WSIZE, %rax
mov (%rax), %rax # rax = size
imulq %rcx # rax = size*(i*ncols + j)
add %rax, WSIZE(%rbx) # TOS = a + rax
INC2_DTSP
xor %rax, %rax
NEXT

vs

: }} ( addr i j -- addr[i][j] )
>R >R
DUP [ 2 CELLS ] LITERAL - 2@ \ &a[0][0] size m
R> * R> + * + ;

In the experimental kforth64, the code L_fsl_mat_addr maps to the
built-in word, "}}".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to All on Mon Dec 18 09:35:25 2023

Looks good for existing FSL code. it might even be faster if you
introduced type-specific and thus shorter address computations
like }}F }}SF }}I and }F }SF }I.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to minforth on Mon Dec 18 07:05:01 2023

On 12/18/23 03:35, minforth wrote:

Looks good for existing FSL code. it might even be faster if you
introduced type-specific and thus shorter address computations
like }}F }}SF }}I and }F }SF }I.

Yes, I expect I might gain an additional 2x speedup for type-specific
versions. But, I want }} to do the size-specific compilation,
eventually. To do that properly, I want to go to a dual-xt system and
avoid explicit STATE-dependence (which has other benefits we've
discussed in the past). That's a bigger project for later.

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to Krishna Myneni on Mon Dec 18 13:32:20 2023

Krishna Myneni wrote:

To do that properly, I want to go to a dual-xt system and
avoid explicit STATE-dependence (which has other benefits we've
discussed in the past). That's a bigger project for later.

Forge ahead. Having worked with both versions (state-smart and
dual-xt), I have to say that the investment in converting a
well-running, debugged Forth system to dual-xt only pays off
if you "POSTPONE like crazy" or are in the habit of writing DSLs.

Otherwise, state-smartness can be an itch, but is not bad per se.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to All on Mon Dec 18 07:37:26 2023

On 12/18/23 07:05, Krishna Myneni wrote:
...

I have pushed these changes in kForth-64 into the main branch at github:

1. make FSL matrix addressing word }} an intrinsic word for higher
efficiency.

2. update the fsl-util.4th to only load the Forth source version of }}
if it is undefined.

The test code fsl-tester.4th (in forth-src/fsl/) may be used to verify
that FSL code runs correctly with the faster version of }}.

These changes will also appear in kForth-32 and in kForth-Win32, after a
short period of testing.

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to minforth on Mon Dec 18 17:05:48 2023

On 12/18/23 07:32, minforth wrote:

Krishna Myneni wrote:

To do that properly, I want to go to a dual-xt system and avoid
explicit STATE-dependence (which has other benefits we've discussed in
the past). That's a bigger project for later.

Forge ahead. Having worked with both versions (state-smart and
dual-xt), I have to say that the investment in converting a
well-running, debugged Forth system to dual-xt only pays off
if you "POSTPONE like crazy" or are in the habit of writing DSLs.

Otherwise, state-smartness can be an itch, but is not bad per se.

It's a tradeoff between having a utilitarian Forth system to compute
various tasks efficiently, versus having a Forth system which provides
tools for exploratory programming such as DSLs and introspection
capabilities.

For larger software projects, being able to write introspection tools
with the underlying language have demonstrable utility (see past
discussion on making word dependency trees). Dual xt systems with a
couple of support words are one way to allow for this.

Balancing these two goals without breaking an actively used Forth system requires a more cautious approach. And, of course, time is limited.

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to Krishna Myneni on Mon Dec 18 18:48:44 2023

On 12/18/23 07:37, Krishna Myneni wrote:

On 12/18/23 07:05, Krishna Myneni wrote:
...

I have pushed these changes in kForth-64 into the main branch at github:

1. make FSL matrix addressing word }} an intrinsic word for higher efficiency.

2. update the fsl-util.4th to only load the Forth source version of }}
if it is undefined.

The test code fsl-tester.4th (in forth-src/fsl/) may be used to verify
that FSL code runs correctly with the faster version of }}.

These changes will also appear in kForth-32 and in kForth-Win32, after a short period of testing.

I also introduced the word "*+" in the last commit in kForth-64.

*+ ( a b c -- n ) \ n = a*b + c

Note that *+ is not the same as the sequence "* +", It has the
equivalent effect of the following:

-ROT * +

R * R> +

*+ is similar to the Forth standard word, "*/". Such ternary operators
reduce stack juggling and make code more readable. As an example, the
custom floating point array words from pde1.4th can now be recoded from

: ]]F@ ( a row col -- r ) >r GRIDSIZE * r> + floats + f@ ;
: ]]F! ( r a row col -- ) >r GRIDSIZE * r> + floats + f! ;

to

: ]]F@ ( a row col -- r ) GRIDSIZE swap *+ floats + f@ ;
: ]]F! ( r a row col -- ) GRIDSIZE swap *+ floats + f! ;

It simplifies the code and increases the performance of SOLVE by about 10%.

I expect the floating point version, F*+, to be equally, if not more
useful for improving readability of fp code and increasing efficiency.

F*+ ( F: r1 r2 r3 -- r ) \ r = r1*r2 + r3

F*+ provides a intrinsic scalar linear transformation. I have not yet
added this word.

--
Krishna

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to mhx on Tue Dec 19 08:43:19 2023

mhx wrote:

minforth wrote:
[..]

In addition, the normal integer and fp data stacks
are full enough with other things, so the mstack is also useful for clearer code.

It probably comes as no surprise that I find locals to be a perfect solution for that.

Yes and no, depending on whether you are working with global matrices à la FSL or with a stack of matrix pointers, where you can DUP and DROP matrices. Copy-on-write is a big plus here, otherwise you have to manage (temporary) matrix copies and perhaps free the data yourself.

With copy-on-write in the background, matrix locals are easy.
OTOH I don't see much advantage in using matrix locals for dealing with
global FSL matrices.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mhx@21:1/5 to minforth on Tue Dec 19 10:34:05 2023

minforth wrote:

mhx wrote:
OTOH I don't see much advantage in using matrix locals for dealing with global FSL matrices.

I don't use the FSL for serious work but I stuck with its basic interface words. Maybe I'll change that when it comes to a 2nd version of iForth
or iSPICE.

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to mhx on Tue Dec 19 11:19:50 2023

mhx wrote:

I don't use the FSL for serious work but I stuck with its basic interface words. Maybe I'll change that when it comes to a 2nd version of iForth
or iSPICE.

"Don't fix it if it ain't broke" ;-)

I'd rather invest in more readability i.e. easy transformation from usual Matlab notation to Forth. A simple example:
Matlab: (v'Av) where A is a matrix and v' is the transposed vector of v
Forth postfix:

.. v' A m* v m* 0> IF ...

A suffix recogniser comes in handy with coding linear algebra.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mhx@21:1/5 to minforth on Tue Dec 19 13:30:40 2023

minforth wrote:

I'd rather invest in more readability i.e. easy transformation from usual Matlab notation to Forth. A simple example:
Matlab: (v'Av) where A is a matrix and v' is the transposed vector of v
Forth postfix:

.. v' A m* v m* 0> IF ...

I have to say

FORTH> 2 2 FLOAT MATRIX a{{
FORTH> 1e 2e 3e 4e a{{ #=> ok
FORTH> a{{ }}print
1.000000e+0000 2.000000e+0000
3.000000e+0000 4.000000e+0000 ok
FORTH> a{{ 1 =rowget a{{ a{{ 1 =colget =mat* swap =mat* }}print
3.000000e+0001 4.000000e+0001
6.600000e+0001 8.800000e+0001 ok
FORTH> a{{ }}print
1.000000e+0000 2.000000e+0000
3.000000e+0000 4.000000e+0000 ok

But I also have XOPG:

: test1 LET a = b*c-3.17e-5/TANH(w)+ABS(x): CR LET. a: ;
LET w = 1.e-3: LET x = -2.5: CR CR test1 -- 2.4682999894333340377777106878373968247191492

0-VALUE GVAL a 0-VALUE GVAL b 0-VALUE GVAL c
0-VALUE GVAL disc ( Used for discriminant )
: quadraticroot ( F: a b c -- y1 y2 )
TO c TO b TO a Pickup coefficients.
LET disc=SQRT(b*b-4*a*c): Set discriminant.
LET ((-b+disc)/(2*a),(-b-disc)/(2*a)): Put values on f-stack.
;
( x*x-3*x+2=0 ) LET quadratic root (1,-3, 2) :
CR .PRINT CR .PRINT -- prints 1 2
( goldenratio ) CR LET. MAX(quadra ticroot (1,-1,-1)): -- 1.6180339887498948482045868343656381177203

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to mhx on Tue Dec 19 14:28:01 2023

mhx wrote:

minforth wrote:

I'd rather invest in more readability i.e. easy transformation from usual
Matlab notation to Forth. A simple example:
Matlab: (v'Av) where A is a matrix and v' is the transposed vector of v
Forth postfix:

.. v' A m* v m* 0> IF ...

I have to say

FORTH> 2 2 FLOAT MATRIX a{{
FORTH> 1e 2e 3e 4e a{{ #=> ok
FORTH> a{{ }}print
1.000000e+0000 2.000000e+0000
3.000000e+0000 4.000000e+0000 ok
FORTH> a{{ 1 =rowget a{{ a{{ 1 =colget =mat* swap =mat* }}print
3.000000e+0001 4.000000e+0001
6.600000e+0001 8.800000e+0001 ok
FORTH> a{{ }}print
1.000000e+0000 2.000000e+0000
3.000000e+0000 4.000000e+0000 ok

But I also have XOPG:

: test1 LET a = b*c-3.17e-5/TANH(w)+ABS(x): CR LET. a: ;
LET w = 1.e-3: LET x = -2.5: CR CR test1 -- 2.4682999894333340377777106878373968247191492

0-VALUE GVAL a 0-VALUE GVAL b 0-VALUE GVAL c
0-VALUE GVAL disc ( Used for discriminant )
: quadraticroot ( F: a b c -- y1 y2 )
TO c TO b TO a Pickup coefficients.
LET disc=SQRT(b*b-4*a*c): Set discriminant.
LET ((-b+disc)/(2*a),(-b-disc)/(2*a)): Put values on f-stack.
;
( x*x-3*x+2=0 ) LET quadratic root (1,-3, 2) :
CR .PRINT CR .PRINT -- prints 1 2
( goldenratio ) CR LET. MAX(quadra ticroot (1,-1,-1)): -- 1.6180339887498948482045868343656381177203

Great! Some of it looks like a more evolved ftran à la Julian Noble.

The matrix notation is a bit special, but when it works for you, okay.
I'd start with your example perhaps this way

#[ 1 2 ; 3 4 ] :=> A ok parse matrix elements, create A and fill it
A #. read A and print

[ 1. 2.
3. 4. ] ok

A' #. read A, transpose (but don't change A) and print

[ 1. 3.
2. 4. ] ok
et cetera

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Krishna Myneni on Tue Dec 19 16:43:41 2023

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

But, I want }} to do the size-specific compilation,
eventually. To do that properly, I want to go to a dual-xt system and
avoid explicit STATE-dependence (which has other benefits we've
discussed in the past).

Avoiding STATE is a good idea. However, what you have in mind seems
to be an optimization, not something like S". Dual-xt words are good
for stuff like S"; you can use them for optimization, but the
intelligent COMPILE, is better for that.

Gforth has had INTERPRET/COMPILE: (for defining a dual-xt word) for a quarter-century and has 9 uses of that in its image, none of them for optimizing (not even before we had SET-OPTIMIZER).

Gforth has had SET-OPTIMIZER (for specifying what the intelligent
COMPILE, does for a word) for less than a decade, and by now we have
the following numbers of uses of SET-OPTIMIZER and related words in
the image:

23 SET-OPTIMIZER
10 OPT:
13 OPTIMIZES
26 FOLDS

where FOLDS can be and is used to specify the optimization of multiple
words, in particular:

' fold1-0 folds drop
' fold1-1 folds invert abs negate >pow2
' fold1-1 folds 1+ 1- 2* 2/ cells cell/ cell+ cell-
' fold1-1 folds floats sfloats dfloats float+
' fold1-1 folds float/ sfloat/ dfloat/
' fold1-1 folds c>s w>s l>s w>< l>< x><
' fold1-1 folds wcwidth
' fold1-1 folds 0> 0= 0<
' fold1-2 folds dup s>d
' fold2-0 folds 2drop
' fold2-1 folds * and or xor
' fold2-1 folds min max umin umax
' fold2-1 folds nip
' fold2-1 folds rshift lshift arshift rol ror
' fold2-1 folds = > >= < <= u> u>= u< u<=
' fold2-1 folds d0> d0< d0=
' fold2-1 folds /s mods
' fold2-2 folds m* um* swap d2* /modf /mods u/mod bounds
' fold2-3 folds over tuck
' fold3-1 folds within select mux */f */s u*/
' fold3-2 folds um/mod fm/mod sm/rem du/mod */modf */mods u*/mod
' fold3-3 folds rot -rot
' fold4-1 folds d= d> d>= d< d<= du> du>= du< du<=
' fold4-2 folds d+ d- 2nip
' fold4-4 folds 2swap
' opt+- folds + -

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to Anton Ertl on Tue Dec 19 17:53:16 2023

On 12/19/23 10:43, Anton Ertl wrote:

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

But, I want }} to do the size-specific compilation,
eventually. To do that properly, I want to go to a dual-xt system and
avoid explicit STATE-dependence (which has other benefits we've
discussed in the past).

Avoiding STATE is a good idea. However, what you have in mind seems
to be an optimization, not something like S". Dual-xt words are good
for stuff like S"; you can use them for optimization, but the
intelligent COMPILE, is better for that.

We had a discussion about this earlier, and I did not like the design of SET-OPTIMIZER changing the behavior of COMPILE, . I don't see the
drawback of changing the xt for compilation state as a method of
optimization. Why add extra complexity with SET-OPTIMIZER when you don't
have to?

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to All on Wed Dec 20 08:05:36 2023

mhx wrote:

0-VALUE GVAL a 0-VALUE GVAL b 0-VALUE GVAL c
0-VALUE GVAL disc ( Used for discriminant )
: quadraticroot ( F: a b c -- y1 y2 )
TO c TO b TO a Pickup coefficients.
LET disc=SQRT(b*b-4*a*c): Set discriminant. >> LET ((-b+disc)/(2*a),(-b-disc)/(2*a)): Put values on f-stack.
;

If I had your formula parser that would be here perhaps
: QUADRATICROOT { f: a b c | d == y1 y2 }
2.0 *to a
let d = sqrt(fma(b,b,-2*a*c))
let y1 = (-b+d)/a
let y2 = (-b-d)/a ;

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to minforth on Wed Dec 20 10:22:23 2023

minforth wrote:

mhx wrote:

0-VALUE GVAL a 0-VALUE GVAL b 0-VALUE GVAL c
0-VALUE GVAL disc ( Used for discriminant )
: quadraticroot ( F: a b c -- y1 y2 )
TO c TO b TO a Pickup coefficients.
LET disc=SQRT(b*b-4*a*c): Set discriminant.
LET ((-b+disc)/(2*a),(-b-disc)/(2*a)): Put values on f-stack.
;

If I had your formula parser that would be here perhaps
: QUADRATICROOT { f: a b c | d == y1 y2 }
2.0 *to a
let d = sqrt(fma(b,b,-2*a*c))
let y1 = (-b+d)/a
let y2 = (-b-d)/a ;

Or with Noble's ftran and ANS locals (extended for fp-numbers)
: QUADROOT {: F: a b c | d -- y1 y2 :}
ft" sqrt(b*b-4*a*c)" to d
ft" (-b+d)/(2*a)" \ y1
ft" (-b-d)/(2*a)" ; \ y1 y2
IOW it is only a matter of personal preference for readability.

Concerning the topic of this thread, the interesting question is
whether the formula parser (infix to postfix translator) can be
extended to work with mixed matrix and fp scalar types.

Exemplary cases to be considered: matrix multiplication is not
commutative, and symbols could be overloaded e.g. A*b means
different things depending on which element is scalar or a matrix.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mhx@21:1/5 to minforth on Wed Dec 20 15:21:20 2023

minforth wrote:

mhx wrote:

0-VALUE GVAL a 0-VALUE GVAL b 0-VALUE GVAL c
0-VALUE GVAL disc ( Used for discriminant )
: quadraticroot ( F: a b c -- y1 y2 )
TO c TO b TO a Pickup coefficients.
LET disc=SQRT(b*b-4*a*c): Set discriminant.
LET ((-b+disc)/(2*a),(-b-disc)/(2*a)): Put values on f-stack.
;

If I had your formula parser that would be here perhaps
: QUADRATICROOT { f: a b c | d == y1 y2 }
2.0 *to a
let d = sqrt(fma(b,b,-2*a*c))
let y1 = (-b+d)/a
let y2 = (-b-d)/a ;

Yes, of course :--) However, my XOPG works with any type
variable, from bytes up to arbitrary-precision floats (here
the output for 64-bit FP is shown). Although elegant (for
some), adding arbitrary precision floats to local variables
is too much, even for a bricklayer like me. Hence the GVAL
global values.

When you have arbitrary precision, it is dangerous
to write something like "2.1 TO a", because 2.1e with the
extended double precision of the interpreter may not round
to exactly the 2.1 needed by, say, a 2048-bit arbitrary
precision number. This can create very subtle bugs.

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to PMF on Wed Dec 20 14:00:16 2023

On 12/20/23 11:58, PMF wrote:

Krishna Myneni wrote:

On 12/19/23 10:43, Anton Ertl wrote:

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

But, I want }} to do the size-specific compilation,
eventually. To do that properly, I want to go to a dual-xt system and
avoid explicit STATE-dependence (which has other benefits we've
discussed in the past).

Avoiding STATE is a good idea. However, what you have in mind seems
to be an optimization, not something like S". Dual-xt words are good
for stuff like S"; you can use them for optimization, but the
intelligent COMPILE, is better for that.

We had a discussion about this earlier, and I did not like the design
of SET-OPTIMIZER changing the behavior of COMPILE, . I don't see the
drawback of changing the xt for compilation state as a method of
optimization. Why add extra complexity with SET-OPTIMIZER when you
don't have to?

--
Krishna

LXF has from the beginning used the compiling XT also for optimization
(code generation). COMPILE, just compiles a call to the XT.
This works perfectly well and is easy to implement.
LXF has about 400 words that generate code while compiling.

This is my preference -- COMPILE, should simply compile the code needed
to execute the xt given to it and not do something clever by
substituting a different xt for it.

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to Krishna Myneni on Fri Dec 22 09:00:46 2023

Krishna Myneni wrote:

This is my preference -- COMPILE, should simply compile the code needed
to execute the xt given to it and not do something clever by
substituting a different xt for it.

Counterexample:
Code inlining is a simple optimisation technique. Why not use it?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Krishna Myneni on Fri Dec 22 14:14:32 2023

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

On 12/19/23 10:43, Anton Ertl wrote:

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

But, I want }} to do the size-specific compilation,
eventually. To do that properly, I want to go to a dual-xt system and
avoid explicit STATE-dependence (which has other benefits we've
discussed in the past).

Avoiding STATE is a good idea. However, what you have in mind seems
to be an optimization, not something like S". Dual-xt words are good
for stuff like S"; you can use them for optimization, but the
intelligent COMPILE, is better for that.

We had a discussion about this earlier, and I did not like the design of >SET-OPTIMIZER changing the behavior of COMPILE, .

If it changes the behaviour of COMPILE, (rather than the
implementation of that behaviour), that's a mistake in the use of SET-OPTIMIZER: Whatever you do, it must not change the behaviour.

I don't see the
drawback of changing the xt for compilation state as a method of >optimization. Why add extra complexity with SET-OPTIMIZER when you don't
have to?

* It's a separation of concerns: SET-OPTIMIZER is for optimization and
must not change the behaviour (i.e., if you replace the call to
SET-OPTIMIZER with DROP, the program should still work), whereas
SET->COMP (and related words such as INTERPRET/COMPILE:) changes the
compilation semantics, i.e., the behaviour.

* If you implement [COMPILE], you need to know if a word has
non-default compilation semantics. If you have the separation of
concerns above, that is easy: If it does not have the default
NAME>COMPILE method (DEFAULT-NAME>COMP), it has non-default
compilation semantics. If you are using this mechanism for a
purpose that does not change the compilation semantics, you have to
add a mechanism that tells the compiler about the difference, and
the user has to provide this information in some way, too (e.g., by
having INTERPRET/COMPILE: if the resulting word has non-default
compilation semantics and INTERPRET/OPTIMIZE: if it has).

* There is a difference in performance if an xt is COMPILE,ed; with
the intelligent COMPILE, the result is as good as going through the
text interpreter; with INTERPRET/COMPILE:, you get a generic call to
the xt, while the text interpreter produces better code.

* The COMPILE, methods get the xt that is COMPILE,d (they have the
same stack effect as COMPILE,), which helps in using the same
implementation for several xts. E.g., FOLD1-1 is the optimizer of
29 words (all with the stack effect ( x -- x ). INTERPRET/COMPILE:
lacks this flexibility. I guess you could have a way that provides
the xt or nt of the word for which you are performing the
compilation semantics; not sure how well that would work.

Finally, my vision for the (far-away) future is that words such as S"
and TO go away, and with them the need for words such as
INTERPRET/COMPILE: or (worse) STATE-smart words.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Krishna Myneni on Fri Dec 22 16:59:49 2023

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

This is my preference -- COMPILE, should simply compile the code needed
to execute the xt given to it and not do something clever by
substituting a different xt for it.

So IYO COMPILE, should be defined as

: compile, ( xt -- )
postpone literal postpone execute ;

?

And of course the POSTPONE EXECUTE must do something other than
"COMPILE," the xt or EXECUTE, or this results in an endless recursion.

Interestingly, development Gforth has this as "GENERAL-COMPILE," and
this is the implementation of "COMPILE," if no more specialized
version is installed. E.g., the more specialized implementation of
COMPILE, for a constant is:

: constant, ( xt -- )
>body @ postpone literal ;

This does not substitute a different xt for the given xt, it compiles
the value of the constant as a literal. Why do you think we should
compile with GENERAL-COMPILE, instead?

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to minforth on Fri Dec 22 18:43:10 2023

On 12/22/23 03:00, minforth wrote:

Krishna Myneni wrote:

This is my preference -- COMPILE, should simply compile the code
needed to execute the xt given to it and not do something clever by
substituting a different xt for it.

Counterexample:
Code inlining is a simple optimisation technique. Why not use it?

I do use code inlining. I don't understand what you mean by this being a counter-example. For example, you may need to give interpretation
semantics for a word that performs a sequence of words, but the
compilation semantics does inline compiling of the sequence.

Or am I misunderstanding you?

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to krishna.myneni@ccreweb.org on Sat Dec 23 10:15:33 2023

In article <um5aeu$1p3uf$1@dont-email.me>,
Krishna Myneni <krishna.myneni@ccreweb.org> wrote:

On 12/22/23 03:00, minforth wrote:

Krishna Myneni wrote:

This is my preference -- COMPILE, should simply compile the code
needed to execute the xt given to it and not do something clever by
substituting a different xt for it.

Counterexample:
Code inlining is a simple optimisation technique. Why not use it?

I do use code inlining. I don't understand what you mean by this being a >counter-example. For example, you may need to give interpretation
semantics for a word that performs a sequence of words, but the
compilation semantics does inline compiling of the sequence.

Or am I misunderstanding you?

Look at my :I
EXAMPLE:
:I R> 1+ >R ;
the code is just inlined without engaging the return stack

:I compiles things and in compilation state it copies until (;)
(an alias for exit), otherwise it executes a normal definition.

: :I
CREATE IMMEDIATE ] LATEST HIDDEN !CSP DOES> \ Just like :
STATE @ IF
BEGIN $@ DUP '(;) <> WHILE , REPEAT 2DROP
ELSE
>R \ Just like :
THEN ;

[In this simple ciforth there is no distinction between , and
COMPILE, . ]

Now it is beyond me, that this has anything to do with "optimisation". Optimisation in the sense, replacing code sequence with faster
execution code sequences.

--
Krishna

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat spinning. - the Wise from Antrim -

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to Krishna Myneni on Sat Dec 23 12:20:26 2023

Krishna Myneni wrote:

I do use code inlining. I don't understand what you mean by this being a counter-example. For example, you may need to give interpretation
semantics for a word that performs a sequence of words, but the
compilation semantics does inline compiling of the sequence.

I am a little surprised that this is a topic of discussion. See also: https://ethz.ch/content/dam/ethz/special-interest/infk/ast-dam/documents/Theodoridis-ASPLOS22-Inlining-Paper.pdf

Well, I consider compiling a single xt to be equivalent to compiling a
function call. Including the function body (apart from reducing function
call overhead)
a) eliminates the xt, so it is no longer there for decompilation or introspection, and
b) creates a wider area for further optimisations.
For example, peephole optimisation can now extend across the boundaries
of the host code and inlined code, et cetera.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to Anton Ertl on Sat Dec 23 09:01:50 2023

On 12/22/23 10:59, Anton Ertl wrote:

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

This is my preference -- COMPILE, should simply compile the code needed
to execute the xt given to it and not do something clever by
substituting a different xt for it.

So IYO COMPILE, should be defined as

: compile, ( xt -- )
postpone literal postpone execute ;

?

The definition of COMPILE, should be equivalent to that, yes. The actual implementation may be system dependent. This appears to me to be
consistent with the standard,

6.2.0945
COMPILE,
“compile-comma”
CORE EXT
Interpretation: Interpretation semantics for this word are undefined. Execution: ( xt – – )
Append the execution semantics of the definition represented by xt to
the execution semantics of the current definition.

And, of course, the execution semantics of xt may be performed by EXECUTE.

And of course the POSTPONE EXECUTE must do something other than
"COMPILE," the xt or EXECUTE, or this results in an endless recursion.

That is the programmer's concern.

Interestingly, development Gforth has this as "GENERAL-COMPILE," and
this is the implementation of "COMPILE," if no more specialized
version is installed. E.g., the more specialized implementation of
COMPILE, for a constant is:

: constant, ( xt -- )
>body @ postpone literal ;

Why do you need a specialized version of COMPILE, for a constant? The
following should work for a constant named C :

In interpreter state:

' C EXECUTE

In compilation state:

: compile-C ['] C compile, ; immediate

: test compile-C . ;

--

5 constant C
ok
' C execute
ok
.s

5
ok
: compile-C ['] C compile, ; immediate
ok
: test compile-C . ;
ok
test
5 ok

Of course, the point of our discussion is that you can make optimized
code for C in compilation state.

This does not substitute a different xt for the given xt, it compiles
the value of the constant as a literal. Why do you think we should
compile with GENERAL-COMPILE, instead?

In a dual-xt system, you are free to specify whatever compilation
sequence you wish -- isn't that the point?

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to minforth on Sat Dec 23 09:05:03 2023

On 12/23/23 06:20, minforth wrote:

Krishna Myneni wrote:

I do use code inlining. I don't understand what you mean by this being
a counter-example. For example, you may need to give interpretation
semantics for a word that performs a sequence of words, but the
compilation semantics does inline compiling of the sequence.

I am a little surprised that this is a topic of discussion. See also: https://ethz.ch/content/dam/ethz/special-interest/infk/ast-dam/documents/Theodoridis-ASPLOS22-Inlining-Paper.pdf

Well, I consider compiling a single xt to be equivalent to compiling a function call. Including the function body (apart from reducing function
call overhead)
a) eliminates the xt, so it is no longer there for decompilation or introspection, and
b) creates a wider area for further optimisations.
For example, peephole optimisation can now extend across the boundaries
of the host code and inlined code, et cetera.

Aren't we saying the same thing? Inlining avoids compiling a function
call by including the function body in the function into which it is
inlined.

I didn't understand why you claimed inlining is a counterexample to
using dual semantics for optimization.

--
KM

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to All on Sat Dec 23 17:17:25 2023

Quoting you:

Krishna Myneni wrote:
This is my preference -- COMPILE, should simply compile the code needed
to execute the xt given to it and not do something clever by
substituting a different xt for it.

Generally spoken, inlining substitutes the original xt with different xts
(the inlined code) or machine code. But okay, just a misunderstanding.

Merry Xmas !! :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to minforth on Sat Dec 23 14:52:39 2023

On 12/23/23 11:17, minforth wrote:

Quoting you:

Krishna Myneni wrote:
This is my preference -- COMPILE, should simply compile the code
needed to execute the xt given to it and not do something clever by
substituting a different xt for it.

Generally spoken, inlining substitutes the original xt with different xts (the inlined code) or machine code. But okay, just a misunderstanding.

Merry Xmas !! :-)

Likewise!

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to Anton Ertl on Thu Dec 28 22:27:07 2023

On 12/22/23 08:14, Anton Ertl wrote:

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

On 12/19/23 10:43, Anton Ertl wrote:

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

But, I want }} to do the size-specific compilation,
eventually. To do that properly, I want to go to a dual-xt system and
avoid explicit STATE-dependence (which has other benefits we've
discussed in the past).

Avoiding STATE is a good idea. However, what you have in mind seems
to be an optimization, not something like S". Dual-xt words are good
for stuff like S"; you can use them for optimization, but the
intelligent COMPILE, is better for that.

We had a discussion about this earlier, and I did not like the design of
SET-OPTIMIZER changing the behavior of COMPILE, .

If it changes the behaviour of COMPILE, (rather than the
implementation of that behaviour), that's a mistake in the use of SET-OPTIMIZER: Whatever you do, it must not change the behaviour.

I don't see the
drawback of changing the xt for compilation state as a method of
optimization. Why add extra complexity with SET-OPTIMIZER when you don't
have to?

* It's a separation of concerns: SET-OPTIMIZER is for optimization and
must not change the behaviour (i.e., if you replace the call to
SET-OPTIMIZER with DROP, the program should still work), whereas
SET->COMP (and related words such as INTERPRET/COMPILE:) changes the
compilation semantics, i.e., the behaviour.

* If you implement [COMPILE], you need to know if a word has
non-default compilation semantics. If you have the separation of
concerns above, that is easy: If it does not have the default
NAME>COMPILE method (DEFAULT-NAME>COMP), it has non-default
compilation semantics. If you are using this mechanism for a
purpose that does not change the compilation semantics, you have to
add a mechanism that tells the compiler about the difference, and
the user has to provide this information in some way, too (e.g., by
having INTERPRET/COMPILE: if the resulting word has non-default
compilation semantics and INTERPRET/OPTIMIZE: if it has).

* There is a difference in performance if an xt is COMPILE,ed; with
the intelligent COMPILE, the result is as good as going through the
text interpreter; with INTERPRET/COMPILE:, you get a generic call to
the xt, while the text interpreter produces better code.

* The COMPILE, methods get the xt that is COMPILE,d (they have the
same stack effect as COMPILE,), which helps in using the same
implementation for several xts. E.g., FOLD1-1 is the optimizer of
29 words (all with the stack effect ( x -- x ). INTERPRET/COMPILE:
lacks this flexibility. I guess you could have a way that provides
the xt or nt of the word for which you are performing the
compilation semantics; not sure how well that would work.

Finally, my vision for the (far-away) future is that words such as S"
and TO go away, and with them the need for words such as
INTERPRET/COMPILE: or (worse) STATE-smart words.

I have not had time to respond to this, but will do so in more detail in
a separate thread. This is an important topic, and it will take me a
little while to digest the points you make above, as well as lookup our previous discussion about SET-OPTIMIZER.

My impression is that, in Gforth, you implement xt for a word as follows:

{ xt-interp, xt-compile [,xt-opt] }

and SET-OPTIMIZER allows you to specify xt-opt. Perhaps some pseudo-code
for INTERPRET will be helpful in understanding how compilation occurs in Gforth.

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Krishna Myneni on Fri Dec 29 08:21:51 2023

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

My impression is that, in Gforth, you implement xt for a word as follows:

{ xt-interp, xt-compile [,xt-opt] }

and SET-OPTIMIZER allows you to specify xt-opt.

Actually, Gforth has 8 header methods (method selectors) and 9 setter
words for specifying the method implementation for a method selector
in the current word:

execute ( ... xt -- ... ) set-execute ( code-address -- )
execute ( ... xt -- ... ) set-does> ( xt -- )
compile, ( xt -- ) set-optimizer ( xt -- )
name>interpret ( nt -- xt ) set->int ( xt -- )
name>compile ( nt -- xt1 xt2 ) set->comp ( xt -- )
(to) ( val xt -- ) set-to ( xt -- )
defer@ ( xt1 -- xt2 ) set-defer@ ( xt -- )
name>string ( nt -- c-addr u ) set->string ( xt -- )
name>link ( nt1 -- nt2|0 ) set->link ( xt -- )

In particular, you change the compilation semantics by specifying the NAME>COMPILE method implementation, which is somewhat different from
providing xt-compsem (your xt-compile). Based on that, we also
provide the convenience words:

set-compsem ( xt -- )
compsem: ( -- ) \ dedicated to Stephen Pelc
intsem: ( -- ) \ special dedication to Stephen Pelc interpret/compile: ( xt-int xt-comp "name" -- )

Usage examples, all equivalent:

:noname ." compiling" ;
: foo ." interpreting" ; set-compsem

: foo ." interpreting" ; compsem: ." compiling" ;
: foo ." compiling" ; intsem: ." interpreting" ;

: foo-int ." interpreting" ;
: foo-comp ." compiling" ;
' foo-int ' foo-comp interpret/compile: foo

Note that several people (including me) have recommended to define,
for every dual-semantics word like FOO, also FOO-INT and FOO-COMP.

Usage:
9 interpret/compile:
1 set-compsem: (in the definition of COMPSEM:)
1 compsem:
0 intselm:
23 set-optimizer (and there are other, derived words)

Read all about it in:

http://www.euroforth.org/ef19/papers/paysan.pdf

Since then, we have moved nt and xt to point to the body.

You can see the header methods (except the code field) with

.hm ( nt -- )

E.g., for seeing the code generator of "!", you say

``! .hm

Let's compare the header methods of "!", ":", and "TO" and the
constant K-F1:

``! .hm ``: .hm ``to .hm ``k-f1 .hm
opt: peephole-compile, :, :, constant,
to: no-to no-to no-to no-to
extra: $0 $0 $0 $0

int: noop noop a>int noop
comp: default-name>comp default-name>comp i/c>comp default-name>comp >string: named>string named>string named>string named>string
link: named>link named>link named>link named>link

The "extra:" field is used for SET-DOES>.

Perhaps some pseudo-code
for INTERPRET will be helpful in understanding how compilation occurs in >Gforth.

For dictionary words, what happens is, in essence:

parse-name find-name dup 0= #-13 and throw ( nt )
state @ if
name>compile
else
name>interpret
then
execute

You won't find it in this form in the current text interpreter,
because the text interpreter is now written to cover all kinds of
recognizers.

What may also be interesting to you is what happens then: For words
with default interpretation and compilation semantics (most words),
the NAME>COMPILE implementation is (simplified)

: default-name>comp ( nt -- xt1 xt2 )
name>interpret ['] compile, ;

For an immediate word, the NAME>COMPILE implementation is:

: imm>comp ( nt -- xt1 xt2 )
name>interpret ['] execute ;

Not just optimization, but all code generation happens in the
implementations of COMPILE,. E.g., ":,", "CONSTANT," and
"PEEPHOLE-COMPILE," are (simplified):

: :, ( xt -- ) >body ['] call peephole-compile, , ;

: constant, ( xt -- ) >body @ postpone literal ;

: peephole-compile, ( xt -- )
\ compile xt, appending its code to the current dynamic superinstruction
lits, here swap , compile-prim1 ;

LITS, ensures that any literals on the literal stack are compiled
before the primitive (this is part of the constant folding
implementation), and COMPILE-PRIM1 ( addr -- ) is in the C part of
Gforth; in gforth-fast, it performs stack caching, combines primitives
into superinstructions, and performs native-code generation if these optimizations are enabled (they are by default), but at the very
least, turns the code-field address (ITC) into a code address (DTC).

Note that in the gforth and gforth-fast engines, "," alone does not
work for compiling a word, because these engines use hybrid
direct/indirect threaded code, which requires primitive-centric code
for colon definitions: In a colon definition, every word is compiled
to a primitive, possibly followed by an immediate argument; e.g., a
colon definition is compiled to the primitive CALL followed by the
address of the called word. See:

@InProceedings{ertl02,
author = {M. Anton Ertl},
title = {Threaded Code Variations and Optimizations (Extended
Version)},
booktitle = {Forth-Tagung 2002},
year = {2002},
address = {Garmisch-Partenkirchen},
url = {http://www.complang.tuwien.ac.at/papers/ertl02.ps.gz},
abstract = {Forth has been traditionally implemented as indirect
threaded code, where the code for non-primitives is
the code-field address of the word. To get the
maximum benefit from combining sequences of
primitives into superinstructions, the code produced
for a non-primitive should be a primitive followed
by a parameter (e.g., \code{lit} \emph{addr} for
variables). This paper takes a look at the steps
from a traditional threaded-code implementation to
superinstructions, and at the size and speed effects
of the various steps.\comment{It also compares these
variants of Gforth to various other Forth
implementations on contemporary machines.} The use
of superinstructions gives speedups of up to a
factor of 2 on large benchmarks on processors with
branch target buffers, but requires more space for
the primitives and the optimization tables, and also
a little more space for the threaded code.}
}

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to Krishna Myneni on Sun Dec 31 10:27:20 2023

On 12/18/23 18:48, Krishna Myneni wrote:

On 12/18/23 07:37, Krishna Myneni wrote:

...

I also introduced the word "*+" in the last commit in kForth-64.

*+ ( a b c -- n ) \ n = a*b + c

Note that *+ is not the same as the sequence "* +", ...

I expect the floating point version, F*+, to be equally, if not more
useful for improving readability of fp code and increasing efficiency.

F*+ ( F: r1 r2 r3 -- r ) \ r = r1*r2 + r3

F*+ provides a intrinsic scalar linear transformation. I have not yet
added this word.

I need to correct my terminology in the prior statement. It is incorrect
to refer to the word F*+ (and its integer counterpart, *+) as a scalar
"linear transformation." They allow efficient calculation of linear
functions, of the form f(x) = a*x + b.

A "linear transformation" is something else. A function effecting a
linear transformation must have the property f(x=0) = 0, which is most certainly not true in the general case for F*+ or *+.

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to Krishna Myneni on Tue Jan 2 13:08:48 2024

Krishna Myneni wrote:

I need to correct my terminology in the prior statement. It is incorrect
to refer to the word F*+ (and its integer counterpart, *+) as a scalar "linear transformation." They allow efficient calculation of linear functions, of the form f(x) = a*x + b.

A "linear transformation" is something else. A function effecting a
linear transformation must have the property f(x=0) = 0, which is most certainly not true in the general case for F*+ or *+.

Mathematically correct. FMA or F*+ is used row-wise or column-wise within
many matrix operations e.g. Gaussian elimination.

Unsurprisingly many modern CPUs even support vectorized FMA.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to Anton Ertl on Tue Jan 2 20:11:37 2024

On 12/29/23 02:21, Anton Ertl wrote:

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

My impression is that, in Gforth, you implement xt for a word as follows:

{ xt-interp, xt-compile [,xt-opt] }

and SET-OPTIMIZER allows you to specify xt-opt.

Actually, Gforth has 8 header methods (method selectors) and 9 setter
words for specifying the method implementation for a method selector
in the current word:

execute ( ... xt -- ... ) set-execute ( code-address -- )
execute ( ... xt -- ... ) set-does> ( xt -- )
compile, ( xt -- ) set-optimizer ( xt -- )
name>interpret ( nt -- xt ) set->int ( xt -- )
name>compile ( nt -- xt1 xt2 ) set->comp ( xt -- )
(to) ( val xt -- ) set-to ( xt -- )
defer@ ( xt1 -- xt2 ) set-defer@ ( xt -- )
name>string ( nt -- c-addr u ) set->string ( xt -- )
name>link ( nt1 -- nt2|0 ) set->link ( xt -- )

In particular, you change the compilation semantics by specifying the NAME>COMPILE method implementation, which is somewhat different from providing xt-compsem (your xt-compile). Based on that, we also
provide the convenience words:

set-compsem ( xt -- )
compsem: ( -- ) \ dedicated to Stephen Pelc
intsem: ( -- ) \ special dedication to Stephen Pelc interpret/compile: ( xt-int xt-comp "name" -- )

Usage examples, all equivalent:

:noname ." compiling" ;
: foo ." interpreting" ; set-compsem

: foo ." interpreting" ; compsem: ." compiling" ;
: foo ." compiling" ; intsem: ." interpreting" ;

: foo-int ." interpreting" ;
: foo-comp ." compiling" ;
' foo-int ' foo-comp interpret/compile: foo

Note that several people (including me) have recommended to define,
for every dual-semantics word like FOO, also FOO-INT and FOO-COMP.

Usage:
9 interpret/compile:
1 set-compsem: (in the definition of COMPSEM:)
1 compsem:
0 intselm:
23 set-optimizer (and there are other, derived words)

Read all about it in:

http://www.euroforth.org/ef19/papers/paysan.pdf

Since then, we have moved nt and xt to point to the body.

You can see the header methods (except the code field) with

.hm ( nt -- )

E.g., for seeing the code generator of "!", you say

``! .hm

Let's compare the header methods of "!", ":", and "TO" and the
constant K-F1:

``! .hm ``: .hm ``to .hm ``k-f1 .hm
opt: peephole-compile, :, :, constant,
to: no-to no-to no-to no-to
extra: $0 $0 $0 $0

int: noop noop a>int noop
comp: default-name>comp default-name>comp i/c>comp default-name>comp >> string: named>string named>string named>string named>string
link: named>link named>link named>link named>link

The "extra:" field is used for SET-DOES>.

Perhaps some pseudo-code
for INTERPRET will be helpful in understanding how compilation occurs in
Gforth.

For dictionary words, what happens is, in essence:

parse-name find-name dup 0= #-13 and throw ( nt )
state @ if
name>compile
else
name>interpret
then
execute

You won't find it in this form in the current text interpreter,
because the text interpreter is now written to cover all kinds of recognizers.

What may also be interesting to you is what happens then: For words
with default interpretation and compilation semantics (most words),
the NAME>COMPILE implementation is (simplified)

: default-name>comp ( nt -- xt1 xt2 )
name>interpret ['] compile, ;

For an immediate word, the NAME>COMPILE implementation is:

: imm>comp ( nt -- xt1 xt2 )
name>interpret ['] execute ;

Not just optimization, but all code generation happens in the
implementations of COMPILE,. E.g., ":,", "CONSTANT," and
"PEEPHOLE-COMPILE," are (simplified):

: :, ( xt -- ) >body ['] call peephole-compile, , ;

: constant, ( xt -- ) >body @ postpone literal ;

: peephole-compile, ( xt -- )
\ compile xt, appending its code to the current dynamic superinstruction
lits, here swap , compile-prim1 ;

LITS, ensures that any literals on the literal stack are compiled
before the primitive (this is part of the constant folding
implementation), and COMPILE-PRIM1 ( addr -- ) is in the C part of
Gforth; in gforth-fast, it performs stack caching, combines primitives
into superinstructions, and performs native-code generation if these optimizations are enabled (they are by default), but at the very
least, turns the code-field address (ITC) into a code address (DTC).

Note that in the gforth and gforth-fast engines, "," alone does not
work for compiling a word, because these engines use hybrid
direct/indirect threaded code, which requires primitive-centric code
for colon definitions: In a colon definition, every word is compiled
to a primitive, possibly followed by an immediate argument; e.g., a
colon definition is compiled to the primitive CALL followed by the
address of the called word. See:

@InProceedings{ertl02,
author = {M. Anton Ertl},
title = {Threaded Code Variations and Optimizations (Extended
Version)},
booktitle = {Forth-Tagung 2002},
year = {2002},
address = {Garmisch-Partenkirchen},
url = {http://www.complang.tuwien.ac.at/papers/ertl02.ps.gz},
abstract = {Forth has been traditionally implemented as indirect
threaded code, where the code for non-primitives is
the code-field address of the word. To get the
maximum benefit from combining sequences of
primitives into superinstructions, the code produced
for a non-primitive should be a primitive followed
by a parameter (e.g., \code{lit} \emph{addr} for
variables). This paper takes a look at the steps
from a traditional threaded-code implementation to
superinstructions, and at the size and speed effects
of the various steps.\comment{It also compares these
variants of Gforth to various other Forth
implementations on contemporary machines.} The use
of superinstructions gives speedups of up to a
factor of 2 on large benchmarks on processors with
branch target buffers, but requires more space for
the primitives and the optimization tables, and also
a little more space for the threaded code.}
}

- anton

Thank you for the details. There is a good deal to absorb here.

--
Krishna

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@21:1/5 to Anton Ertl on Wed Jan 3 13:42:26 2024

Anton Ertl wrote:

Not just optimization, but all code generation happens in the
implementations of COMPILE,. E.g., ":,", "CONSTANT," and
"PEEPHOLE-COMPILE," are (simplified):

: :, ( xt -- ) >body ['] call peephole-compile, , ;

: constant, ( xt -- ) >body @ postpone literal ;

: peephole-compile, ( xt -- )
compile xt, appending its code to the current dynamic superinstruction
lits, here swap , compile-prim1 ;

LITS, ensures that any literals on the literal stack are compiled
before the primitive (this is part of the constant folding
implementation), and COMPILE-PRIM1 ( addr -- ) is in the C part of
Gforth; in gforth-fast, it performs stack caching, combines primitives
into superinstructions, and performs native-code generation if these optimizations are enabled (they are by default), but at the very
least, turns the code-field address (ITC) into a code address (DTC).

For constant folding, superinstructions and peephole optimisation,
my compiler has a FIFO token queue for delayed compilation.
Optimisations are based on simple pattern recognition of the queue content. When active, this is fully automatic and does not require complicated contortions and a bag of special words.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to minforth on Thu Jan 4 11:33:04 2024

minforth@gmx.net (minforth) writes:

Anton Ertl wrote:

Not just optimization, but all code generation happens in the
implementations of COMPILE,. E.g., ":,", "CONSTANT," and
"PEEPHOLE-COMPILE," are (simplified):

: :, ( xt -- ) >body ['] call peephole-compile, , ;

: constant, ( xt -- ) >body @ postpone literal ;

: peephole-compile, ( xt -- )
\ compile xt, appending its code to the current dynamic superinstruction >> lits, here swap , compile-prim1 ;

LITS, ensures that any literals on the literal stack are compiled
before the primitive (this is part of the constant folding
implementation), and COMPILE-PRIM1 ( addr -- ) is in the C part of
Gforth; in gforth-fast, it performs stack caching, combines primitives
into superinstructions, and performs native-code generation if these
optimizations are enabled (they are by default), but at the very
least, turns the code-field address (ITC) into a code address (DTC).

For constant folding, superinstructions and peephole optimisation,
my compiler has a FIFO token queue for delayed compilation.
Optimisations are based on simple pattern recognition of the queue content. >When active, this is fully automatic and does not require complicated >contortions and a bag of special words.

No words for these optimizations in this compiler? So they are
written in a different language than Forth.

Gforth also has that, and COMPILE-PRIM1 is the interface to it. It
combines primitives into superinstructions, it optimizes stack
cacheing and it optimizes many instruction-pointer updates away. It
does not use a FIFO, but compiles a superblock at a time. It is quite complicated (contorted?), but it does produce good results. You can
see the results by disabling the optimizations. Here are results on a
Ryzen 5800X.

sieve bubble matrix fib fft
0.037 0.035 0.013 0.031 0.021 gforth-fast
0.058 0.061 0.034 0.054 0.020 gforth-fast --ss-states=0 --ss-number=0 --opt-ip-updates=0

Gforth implements constant folding and some related optimizations,
such as optimizing "2 PICK" to "THIRD" on the Forth level, and that,
of course means that there is "a bag of special words". Matthias Koch
invented this approach. It uses a literal stack, and "LITERAL" pushes
its argument to that stack rather than compiling a literal.
"COMPILE," implementations of many words look at the literal stack and
perform constant folding if all arguments are constant, or sometimes
other optimizations (e.g., of 2 PICK) if some of the arguments are
constant.

This is actually a simple and straightforward optimization, and has
the nice property that most of the optimization code is very local to
the word that is optimized. E.g., the 2 PICK optimization looks as
follows:

:noname ( xt -- )
lits# 1 u>= if
lits> case
0 of postpone dup drop exit endof
1 of postpone over drop exit endof
2 of postpone third drop exit endof
3 of postpone fourth drop exit endof
dup >lits
endcase
then
peephole-compile, ;
optimizes pick

And note that DUP also has a constant-folding optimization, but that's
local to DUP. So if you actually compile

: foo 3 0 pick ;

you get the same code as with

: foo 3 3 ;

because first PICK is optimized to "COMPILE," DUP, and then DUP is
optimized. I doubt that this particular optimization sequence will
ever happen, but there have been other cases where I have been
positively surprised how far this simple optimization got us.

Anyway, back to the "special words". I guess you mean "LITS,". Yes,
there are some places where "LITS," has to be used; these are 4 places
in Gforth, three of them corresponding to code generation boundaries
(BEGIN, the start of a closure, and BASIC-BLOCK-END), and the use in "PEEPHOLE-COMPILE," just before calling the lower-level compiler
COMPILE-PRIM1 shown above. If you look at it from a higher level
(e.g., at the level of COMPILE,), it's "fully automatic".

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to dxf on Fri Jan 5 11:27:15 2024

dxf <dxforth@gmail.com> writes:

On 4/01/2024 10:33 pm, Anton Ertl wrote:

:noname ( xt -- )
lits# 1 u>= if
lits> case
0 of postpone dup drop exit endof
1 of postpone over drop exit endof
2 of postpone third drop exit endof
3 of postpone fourth drop exit endof
dup >lits
endcase
then
peephole-compile, ;
optimizes pick

What this demonstrates is how badly ANS CASE needs optimization.
PICK not so much.

What makes you think so?

Sure, this could be rewritten as:

create npicks ' dup , ' over , ' third , ' fourth ,

:noname ( xt -- )
lits# 1 u>= if
lits> dup 4 u< if
cells npicks + @ compile, exit then
>lits then
peephole-compile, ;
optimizes pick

and the result would be slightly faster when it is used, PICK is
compiled not that often, so why optimize it? Especially given that,
if the optimization was really useful, one could just write the latter
code.

Another question in this context is which version is easier to write
correctly and easier to read and understand.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to dxforth@gmail.com on Sat Jan 6 14:14:33 2024

In article <unaf0p$b7ql$1@dont-email.me>, dxf <dxforth@gmail.com> wrote:

On 5/01/2024 10:27 pm, Anton Ertl wrote:

dxf <dxforth@gmail.com> writes:

On 4/01/2024 10:33 pm, Anton Ertl wrote:

:noname ( xt -- )
lits# 1 u>= if
lits> case
0 of postpone dup drop exit endof
1 of postpone over drop exit endof
2 of postpone third drop exit endof
3 of postpone fourth drop exit endof
dup >lits
endcase
then
peephole-compile, ;
optimizes pick

What this demonstrates is how badly ANS CASE needs optimization.
PICK not so much.

What makes you think so?

Is it not obvious to all that ENDOF jumps in the above are compiled but
never used? I'm curious. What is it about ANS-FORTH that users and >implementers alike have become so uncritical?

Sure, this could be rewritten as:

create npicks ' dup , ' over , ' third , ' fourth ,

:noname ( xt -- )
lits# 1 u>= if
lits> dup 4 u< if
cells npicks + @ compile, exit then
>lits then
peephole-compile, ;
optimizes pick

and the result would be slightly faster when it is used, PICK is
compiled not that often, so why optimize it? Especially given that,
if the optimization was really useful, one could just write the latter
code.

Don't make a computer do what intelligence will do anyway.

Another question in this context is which version is easier to write
correctly and easier to read and understand.

When the cases are 0 1 2 3 etc execution tables are easy to write and
overall code is minimal. But such is not often the case.

Optimiser for each word? IMHO this is peephole gone to far.
I prefer an optimisation table.

Adding
"
0 PICK | DUP |
1 PICK | OVER |
2 PICK | ROT DUP >R ROT ROT R> |
"
is easy enough.
Is it worth it?

From my optimiser:

: (MATCH-TABLE) |
\ `` MATCH-TABLE'' points here :
'P EXECUTE | P | \ Execute optimisation
'P + 'P + | 'P 'P + + | \ Associativity optimisation
'P 'P D+ 'P 'P D+ | 'P 'P 'P 'P D+ D+ | \ Associativity optimisation
'P + 'P - | 'P 'P - + |
'P - 'P + | 'P 'P - - |
'P - 'P - | 'P 'P + - |
'P @ 'P ! | NOOP |
'P ! 'P @ | DUP 'P ! |
'P M* DROP 'P M* DROP | 'P 'P M* DROP M* DROP | \ Invalid if last drop removed!
'P OR 'P OR | 'P 'P OR OR |
'P AND 'P AND | 'P 'P AND AND |
'P XOR 'P XOR | 'P 'P XOR XOR |
[ 0 ]L + | NOOP | \ Shortcut evaluations
[ 0 ]L - | NOOP |
[ 0 ]L M* DROP | DROP 0 |
[ 0 ]L OR | NOOP |
[ 0 ]L AND | DROP 0 |
[ 0 ]L XOR | NOOP |
[ 1 ]L M* DROP | NOOP |
[ 1 ]L / | NOOP |
[ -1 ]L M* DROP | NEGATE |
[ -1 ]L / | NEGATE |
[ -1 ]L OR | DROP -1 |
[ -1 ]L AND | NOOP |
[ -1 ]L XOR | INVERT |
'P LSHIFT 'P LSHIFT | 'P 'P + LSHIFT | \ Distributivity optimisation
'P RSHIFT 'P RSHIFT | 'P 'P + RSHIFT |
[ 0 ]L 0BRANCH [ 'P , ] | NOP1 NOP1 BRANCH [ 'P , ] | \ Branch optimisation
'P 0BRANCH [ 'P , ] | NOOP | \ Non-zero, zero is matched by previous
BRANCH [ 0 , ] | NOOP |
0BRANCH [ 0 , ] | DROP |
< 0= | 1+ > |

0= | 1- < |
R R> | NOOP |

R | NOOP |

;

This is on top of generalised folding and before inspecting the
machine code.
The machine code matcher relies heavily on my ciasdis assembler

<! !Q LEA, XO| !!T 0 {L,} ~!!T
!Q LEA, XO| !!T 0 {L,} ~!!T !>
<A QX: !TALLY LEA, XO| 0 L, !TALLY A>
{ bufv L@ $FFFFFF AND bufc OR!U
bufv 3 + L@ bufv 10 + L@ + bufc 3 + L! }
optimisation lealea-pattern

This generates an optimisation object, with a match pattern,
a replace pattern and a replacing xt.
This is definitively ugly. In view of special registers treatment in
the 86 I am at the verge of giving up.
Making a RISCV optimiser is probably a much more fruitful endeavour.

(The optimised version of the original byte sieve benchmark comes
with 10/20 procent of vfx and swiftforth versions, but this is unfair.)

Groetjes Albert

--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat spinning. - the Wise from Antrim -

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to dxf on Sat Jan 6 16:35:36 2024

dxf <dxforth@gmail.com> writes:

On 5/01/2024 10:27 pm, Anton Ertl wrote:

dxf <dxforth@gmail.com> writes:

On 4/01/2024 10:33 pm, Anton Ertl wrote:

:noname ( xt -- )
lits# 1 u>= if
lits> case
0 of postpone dup drop exit endof
1 of postpone over drop exit endof
2 of postpone third drop exit endof
3 of postpone fourth drop exit endof
dup >lits
endcase
then
peephole-compile, ;
optimizes pick

What this demonstrates is how badly ANS CASE needs optimization.
PICK not so much.

What makes you think so?

Is it not obvious to all that ENDOF jumps in the above are compiled but
never used?

No. It depends on the Forth system. This optimization would be
relatively easy to implement (see below), but I never found it
worthwhile.

How to implement it? When performing the compilation semantics of
EXIT, AHEAD or AGAIN, just set a flag DEAD-CODE, and clear that flag
when performing the compilation semantics of THEN or BEGIN, or when
starting a new definition. Let the code-generation words ("COMPILE,",
LITERAL etc.) skip the code generation action when DEAD-CODE is set.

Gforth maintains DEAD-CODE for the automatic scoping of local
variables, but does not skip code generation.

This optimization would also eliminate the final exit in the pattern

begin ... again ;

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Krishna Myneni@21:1/5 to Krishna Myneni on Wed Jan 17 08:57:56 2024

On 12/13/23 18:38, Krishna Myneni wrote:

The array and matrix definitions given in the FSL utilities source,

fsl-util.x
or
fsl_util.x.

The arrays are quite nice for their flexibility in making arrays out of
any type of data, and are useful in many instances. However, the source definitions are slow on non-optimizing Forth systems.

I believe the design of the arrays and matrices traces back to Julian Noble's, "Scientific Forth." A 1-D array is named with a trailing left
brace "{" while 2-D matrices have a trailing double left brace, "{{".
This allows a convenient notation

a{ I } \ resolves to address of array element I
m{{ J I }} \ resolves to address of matrix element at row J, col I

...

For writing code which uses FSL arrays of FLOATS, we can improve the
efficiency significantly by defining ]F@ and ]F! . Recently, I had to
modify my Numerov integrator module in this manner, not primarily for efficiency, but to be able to pass the address of an arbitrary element
index in an FSL array which is treated as the first element of a sub
array (this can't be done with the usual FSL array indexing method).

See

https://github.com/mynenik/kForth-64/blob/master/forth-src/fsl/extras/numerov.4th

--
KM

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Kampej
  Tue Jul 15 21:53:12 2025
  from Scotland via SSH
- Michal Wronka
  Tue Jul 15 18:49:59 2025
  from Wroclaw, Poland via SSH
- Plume
  Tue Jul 15 15:05:55 2025
  from Uk via SSH
- Plume
  Tue Jul 15 15:02:54 2025
  from Uk via SSH
- Michal Wronka
  Tue Jul 15 11:15:38 2025
  from Wroclaw, Poland via SSH
- Bagwaa
  Tue Jul 15 08:28:34 2025
  from Nottingham via Telnet
- Bob Worm
  Mon Jul 14 22:02:50 2025
  from Wales, Uk via Telnet
- Guest
  Mon Jul 14 21:51:19 2025
  from A via Raw

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	509
Nodes:	16 (2 / 14)
Uptime:	53:20:30
Calls:	10,005
Calls today:	6
Files:	13,843
Messages:	6,363,047

Inefficiency of FSL matrices

Who's Online

Recent Visitors

System Info