512 longs of dual-port register RAM for code and fast variables;Pairwise Parallax cores can access some of their neighbors registers.
512 longs of dual-port lookup RAM for code, streamer lookup, and variables; Access to to (1M?) hub RAM every 8 clock cycles.
Okay, no one was too excited about my previous proposal, so I am proposing something else.
For my master's thesis at Silesian University of Technology, I am now considering building an 8 core Forth CPU for real time control. Hard real time control on a single cpu is difficult, much better to allocate one cpu to controlling each signal.
This would be a bit like the Propeller Parallax 2, and a bit different. That device uses register machines, this would be based on stack machines.
That device emulates Forth. This would have native Forth instructions.
In that device, each core has
512 longs of dual-port register RAM for code and fast variables;Pairwise Parallax cores can access some of their neighbors registers.
512 longs of dual-port lookup RAM for code, streamer lookup, and variables;
Access to to (1M?) hub RAM every 8 clock cycles.
I would like to make it 8 proper Forth CPU's rather than register machines.
Rather than a big central hub memory, I would like each core to have more memory. How much? With two port memories, they could each share 2 * (1/8)th of the memory. (1/4) of the total memory. Not bad.
I would like communication between adjacent cores. I like how the GA144 allows neighboring cores to communicate. That seems important to me.
I wonder if this would be of interest to anyone?
There is a good chance that I would do this in cooperation with the AI and Robotics guys and their CORE 1 CPU.
Ting's ep16/24/32 are also interesting.
There are a bunch of other cores I need to evaluate as well. Everyone speaks well of the J1.
In other news, school is going well. I am really impressed with the education here. As a software developer, I completely misunderstood how to write Verilog.
If I had tried it, it would have been a disaster One software developer famously used nested verilog while loops to generate a slow clock pulse. I strongly advise any developer considering designing a chip, to get educated in digital design first.
Alternatively, you can do something like use the Intel design tools, to lay out components and their connectivity. I am sure that there are other such tools out there. But starting with Verilog, or even vhdl, for a software developer, is bound to cometo an endless stream of problems.
It has been mentioned elsewhere, that multiple CPUs can share the same hardware by the use of pipelining.You totally lost me on that one. I can clearly imagine 8 Forth Cores. I have no idea how to turn them into a pipeline. Is there a link? My whole goal was small fast and simple.
With only 8 processors, much better would be the use of shared memory. This allows the exchange of data between *any* two processors, or even between them all.Very very interesting point. That is clearly how modern cpus work. How EE's think. A large L1 cache. I am rather interested in the other end of the design space. Maybe by the time I complete my training, I will agree more with traditional
https://www.youtube.com/watch?v=KXjQdKBl7ag&t=599s https://github.com/angelus9/AI-RoboticsThere is a good chance that I would do this in cooperation with the AI and Robotics guys and their CORE 1 CPU.I'm not familiar with the CORE 1 CPU. Is it multiple processors?
He wrote code like it was software, and got it running.Great story.
I'm not aware that anyone supports anything other than HDL.I could not find it, but when we start using it after Easter, I will post the link.
How do you do a diff on a schematic?Great question.
On Saturday, April 1, 2023 at 7:01:28 PM UTC+2, Lorem Ipsum wrote:
It has been mentioned elsewhere, that multiple CPUs can share the same hardware by the use of pipelining.You totally lost me on that one. I can clearly imagine 8 Forth Cores. I have no idea how to turn them into a pipeline. Is there a link? My whole goal was small fast and simple.
engineers. We will see.With only 8 processors, much better would be the use of shared memory. This allows the exchange of data between *any* two processors, or even between them all.Very very interesting point. That is clearly how modern cpus work. How EE's think. A large L1 cache. I am rather interested in the other end of the design space. Maybe by the time I complete my training, I will agree more with traditional electrical
https://www.youtube.com/watch?v=KXjQdKBl7ag&t=599s https://github.com/angelus9/AI-RoboticsThere is a good chance that I would do this in cooperation with the AI and Robotics guys and their CORE 1 CPU.I'm not familiar with the CORE 1 CPU. Is it multiple processors?
He wrote code like it was software, and got it running.Great story.
I'm not aware that anyone supports anything other than HDL.I could not find it, but when we start using it after Easter, I will post the link.
How do you do a diff on a schematic?Great question.
Thank you for the engagement. If I recall correctly, you also wrote a Forth CPU, could you be so kind as to provide a link?
But still no one else is in many core Forth cpus. I am a minority in a minority.
I'm not familiar with the CORE 1 CPU. Is it multiple processors?
On Saturday, April 1, 2023 at 8:41:51 PM UTC+2, Lorem Ipsum wrote:
The current verilog is a single processor, they plan on going multi-processor, but like me they are not sure which application to target.I'm not familiar with the CORE 1 CPU. Is it multiple processors?
I do understand how to make a register machine pipelined. I have no idea how to make a stack machine pipelined.
I do understand how to make a register machine pipelined. I have no idea how to make a stack machine pipelined.How is it any different???
=20
I do understand how to make a register machine pipelined. I have no ide= >a how to make a stack machine pipelined.How is it any different???=20
Fetch the instruction,=20
fetch the operands,=20
do the instruction,=20
write the results.=20
On a stack machine, the operands are already on the stack, and the result i= >s written to the stack,=20
so there is no opportunity to pipeline those.
The only thing you could do i=
s to fetch the next instruction at the same time, or to parse a word into m= >ultiple instructions, but that is only a two stage pipeline, so the word pi= >peline did not come to mind.
Still nobody I know of but me and the CORE-1 guys are are interested in a = >multi-core forth machines.
I just thought of that as doing two things simultaneously. Maybe it is 3 things.Fetch the instruction,I do understand how to make a register machine pipelined. I have no idea how to make a stack machine pipelined.How is it any different???
fetch the operands,
do the instruction,
write the results.
On a stack machine, the operands are already on the stack, and the result is written to the stack,
so there is no opportunity to pipeline those. The only thing you could do is to fetch the next instruction at the same time, or to parse a word into multiple instructions, but that is only a two stage pipeline, so the word pipeline did not come to mind.
Thank you for the question, it helped my understanding grow.
Still nobody I know of but me and the CORE-1 guys are are interested in a multi-core forth machines. I am not sure what I am going to do next.
Note that nobody has ever done a successful barrel processor design for a CPUIf i understand correclty, then XMOS has a barrel processor. Each core has multiple sets of registers, which can get swapped instantly.
But they are a tiny company.
Note that nobody has ever done a successful barrel processor design for a CPU
Christopher Lozinski <caloz...@gmail.com> writes:
=20
I do understand how to make a register machine pipelined. I have no ide= >a how to make a stack machine pipelined.How is it any different???=20
Fetch the instruction,=20
fetch the operands,=20
do the instruction,=20
write the results.=20
On a stack machine, the operands are already on the stack, and the result i=The way you present it, you have just the same opportunities as for a register machine (and of course, also the costs, such as forwarding
s written to the stack,=20
so there is no opportunity to pipeline those.
the result to the input if you want to be able to execute instructions back-to-back). And if you do it as a barrel processor, as suggested
by Lorem Ipsum, AFAICS you have to do that.
I don't think that pipelining to make a barrel processor makes much
sense for you. It increases the design cost to possibly save some transistors/area compared to having that many individual processors,
but the pipelining itself also costs transistors/area, and it's not
clear that you actually save something.
Note that nobody has ever
done a successful barrel processor design for a CPU (and I only
remember the Tera MTA as an attempt to do it at all); the well-known
example of a barrel processor is the I/O processor of the CDC 6600.
Of course, if you put 8 individual cores on a chip with a single
memory interface, you somehow have to arbitrate the access to the
memory; one way to do that may be to have a single load/store unit
that gets requests from the individual cores and proecesses them one
after the other, somewhat like a barrel processor.
Back to the question of pipelining: If you let you stack machine run
only a single thread, you save quite a bit compared to a register
machine: the output of the ALU is one of its inputs (well, you may
want to MUX the output of the load unit (and other units) in between),
so you get the benefits of forwarding automatically. Let's assume you implement the rest of the stack as a register file (plus a stack
pointer register, maybe predecoded); then the stack architecture tells
you early which register is the other operand, and you can perform the access in parallel with the instruction fetch.
You can perform the execution of a given instruction in parallel with
the fetch of the next one, if you do the instruction fetch in a
separate pipeline stage.
The only thing you could do i=A two-stage pipeline is still a pipeline.
s to fetch the next instruction at the same time, or to parse a word into m=
ultiple instructions, but that is only a two stage pipeline, so the word pi= >peline did not come to mind.
Still nobody I know of but me and the CORE-1 guys are are interested in a = >multi-core forth machines.
Chuck Moore's work for quite a while has been on multi-core Forth
machines, but the interest from potential users seems to be limited;
most interest seems to be based on his earlier merits as discoverer
(as he puts it) of Forth.
On Sunday, April 2, 2023 at 5:50:41=E2=80=AFAM UTC-4, Christopher Lozinski = >wrote:...
a CPUNote that nobody has ever done a successful barrel processor design for=
If i understand correclty, then XMOS has a barrel processor. Each core ha= >s multiple sets of registers, which can get swapped instantly.=20
I don't know that the term "barrel processor" has any real meaning. I thou= >ght Anton was using the term for the sort of pipelined design multi-process= >or I was referring to.
That is not the same as the XMOS thing. There, the=
y have multiple, independent processors, which use a common memory, by inte= >rleaving accesses.
On Sunday, April 2, 2023 at 4:53:14=E2=80=AFAM UTC-4, Anton Ertl wrote:
Christopher Lozinski <caloz...@gmail.com> writes:=20
=3D20=20The way you present it, you have just the same opportunities as for a=20
a how to make a stack machine pipelined.I do understand how to make a register machine pipelined. I have no = >ide=3D
How is it any different???=3D20=20=20
Fetch the instruction,=3D20=20
fetch the operands,=3D20=20
do the instruction,=3D20=20
write the results.=3D20=20
=20
On a stack machine, the operands are already on the stack, and the resul= >t i=3D=20
s written to the stack,=3D20
so there is no opportunity to pipeline those.
register machine (and of course, also the costs, such as forwarding=20
the result to the input if you want to be able to execute instructions=20
back-to-back). And if you do it as a barrel processor, as suggested=20
by Lorem Ipsum, AFAICS you have to do that.=20
I don't know what AFAICS means,
but in a "barrel" processor, as you call it=
, you don't need any special additions to the design to accommodate this ty= >pe of pipelining, because there is no overlap of processing instructions of=
a single, virtual processor. The instruction is processed 100% before beg=
inning the next instruction. With no overlap, there's no need for "forward= >ing the result".=20
If you say that, you don't understand what is going on. The only added cos= >t in a barrel processor, are the added FFs, which are not "added" relative = >to multiple cores. Meanwhile, you have saved all the logic between the FFs= >. The amount of additional logic, would be very minimal. So there would b= >e a large savings in logic overall.=20
How many commercial stack proce=
ssors have you seen in the last 20 years? I know of none. So why bother=
trying to design a stack processor? =20
Lorem Ipsum <gnuarm.del...@gmail.com> writes:
On Sunday, April 2, 2023 at 5:50:41=E2=80=AFAM UTC-4, Christopher Lozinski =...
wrote:
a CPUNote that nobody has ever done a successful barrel processor design for=
If i understand correclty, then XMOS has a barrel processor. Each core ha=s multiple sets of registers, which can get swapped instantly.=20
I don't know that the term "barrel processor" has any real meaning. I thou= >ght Anton was using the term for the sort of pipelined design multi-process=
or I was referring to.
I did.
That is not the same as the XMOS thing. There, the=
y have multiple, independent processors, which use a common memory, by inte=
rleaving accesses.
<https://en.wikipedia.org/wiki/Barrel_processor> claims that the XCore
XS1 is a barrel processor. So either this claim is wrong, or they
have switched from a barrel processor design to one more along the
lines of what I have suggested (if he wants to go for multi-core at
all). Either variant supports my claim of lack of success for barrell processors in the CPU market.
Lorem Ipsum <gnuarm.del...@gmail.com> writes:
On Sunday, April 2, 2023 at 4:53:14=E2=80=AFAM UTC-4, Anton Ertl wrote:
Christopher Lozinski <caloz...@gmail.com> writes:=20ide=3D
=3D20=20
I do understand how to make a register machine pipelined. I have no =
t i=3D=20a how to make a stack machine pipelined.
How is it any different???=3D20=20=20
Fetch the instruction,=3D20=20
fetch the operands,=3D20=20
do the instruction,=3D20=20
write the results.=3D20=20
=20
On a stack machine, the operands are already on the stack, and the resul=
s written to the stack,=3D20The way you present it, you have just the same opportunities as for a=20 >> register machine (and of course, also the costs, such as forwarding=20
so there is no opportunity to pipeline those.
the result to the input if you want to be able to execute instructions=20 >> back-to-back). And if you do it as a barrel processor, as suggested=20
by Lorem Ipsum, AFAICS you have to do that.=20
I don't know what AFAICS means,As Far As I Can See.
but in a "barrel" processor, as you call it=
, you don't need any special additions to the design to accommodate this ty=
pe of pipelining, because there is no overlap of processing instructions of=
a single, virtual processor. The instruction is processed 100% before beg=
inning the next instruction. With no overlap, there's no need for "forward= >ing the result".=20
Yes. My wording was misleading. What I meant: If you want to
implement a barrel processor with a stack architecture, you have to
treat the stack in many respects like a register file, possibly
resulting in a pipeline like above.
By contrast, for a single-thread stack-based CPU, what is the
forwarding bypass (i.e., an optimization) of a register machine is the normal path for the TOS of a stack machine; but not for a barrel
processor with a stack architecture.
If you say that, you don't understand what is going on. The only added cos= >t in a barrel processor, are the added FFs, which are not "added" relative =
to multiple cores. Meanwhile, you have saved all the logic between the FFs= >. The amount of additional logic, would be very minimal. So there would b= >e a large savings in logic overall.=20
The logic added in pipelining depends on what is pipelined (over in comp.arch Mitch Alsup has explained several times how expensive a
deeply pipelined multiplier is: at some design points it's cheaper to
have two multipliers with half the pipelining that are used in
alternating cycles).
In any case, the cost is significant in
transistors, in area and in power; in the early 2000s Intel and AMD
planned to continue their clock race by even deeper pipelining than
they had until then (looking at pipelines with 8 FO4 gate equivalents
per stage), but they found that they had trouble cooling the resulting
CPUs, and so settled on ~16 FO4 gate equivalents per stage.
How many commercial stack proce=
ssors have you seen in the last 20 years? I know of none. So why bother=
trying to design a stack processor? =20
My understanding is that this is a project he does for educational
purposes. I think that he can learn something from designing a stack processor; and if that's not enough, maybe some extension or other.
He may also learn something from designing a barrel processor. But
from designing a barrel processor with a stack architecture, at best
he will learn why that squanders the implementation benefits of a
stack architecture; but without first designing a single-threaded
stack machine, I fear that he would miss that, and would not learn
much about what the difference between stack and register machines
means for the implementation, and he may also miss some interesting properties of barrel processors.
Exactly what advantage do you see from using a stack processor over register based processors, when going multi-core?
How many commercial stack processors have you seen in the last 20 years? I know of none. So why bother trying to design a stack processor?
Lorum Ipsum wrote:
Barrel Processor.
Okay, now I get it. share the ALU between cores on a time sliced basis.
in some sense the Propeller Parallax does this. You only get access to core memory every 8 cycles.
But we can do even better than that.
Ting did an ALU, where he calculated everything at once.
One could share all of that logic, and each "barrel core" could just use the ALU operation it needed.
Some , some delays, but overall huge sharing, maybe energy and space savings.
Thank you.
Rick C asked:
Exactly what advantage do you see from using a stack processor over register based processors, when going multi-core?I believe that a stack machine is smaller than a register machine, takes up less real estate, so you can have a lot more on a single chip/fpga. Faster clock cycles too, less energy per useful computation.
Rick C. asks:
How many commercial stack processors have you seen in the last 20 years? I know of none. So why bother trying to design a stack processor?
Because many small processors should be able to out perform a few big processors.
Because all the engineers keep putting more into each layer of the hardware and software stacks. And there is a huge benefit to just shrinking the entire stack and making it understandable to mere mortals. Don't optimize the pieces, optimize the entiresystem.
The top designs are register based, with a few stack designs in the top 10 or 20. The J1 is in the top 10.
On Sunday, April 2, 2023 at 9:03:48=E2=80=AFAM UTC-4, Anton Ertl wrote:
Yes. My wording was misleading. What I meant: If you want to=20
implement a barrel processor with a stack architecture, you have to=20
treat the stack in many respects like a register file, possibly=20
resulting in a pipeline like above.=20
I'm still not following. I'm not sure what you have to do with the registe= >r file, other than to have N of them like all other logic. The stack can b= >e implemented in block RAM.
A small counter points to the stack being proc=
essed at that time. You can only perform one stack read and one write for = >each processor per instruction. =20
By contrast, for a single-thread stack-based CPU, what is the=20
forwarding bypass (i.e., an optimization) of a register machine is the=20
normal path for the TOS of a stack machine; but not for a barrel=20
processor with a stack architecture.=20
I guess I simply don't know what you mean by "forwarding bypass". I found = >this.=20
https://en.wikipedia.org/wiki/Operand_forwarding
But I don't follow that either. This has to do with the data of the two in= >struction being related. In the barrel stack processor, each phase of the = >processor is an independent instruction stream.
Every time the stack is adjusted, the CPU would s=
tall. =20
The logic added in pipelining depends on what is pipelined (over in=20
comp.arch Mitch Alsup has explained several times how expensive a=20
deeply pipelined multiplier is: at some design points it's cheaper to=20
have two multipliers with half the pipelining that are used in=20
alternating cycles).=20
If you are talking about adding logic for a pipeline, that is some optimiza= >tion you are performing. It's not inherent in the pipelining itself. Pipe= >lining only requires that the logic flow be broken into steps by registers.=
Lorem Ipsum <gnuarm.del...@gmail.com> writes:
On Sunday, April 2, 2023 at 9:03:48=E2=80=AFAM UTC-4, Anton Ertl wrote:
Yes. My wording was misleading. What I meant: If you want to=20
implement a barrel processor with a stack architecture, you have to=20
treat the stack in many respects like a register file, possibly=20
resulting in a pipeline like above.=20
I'm still not following. I'm not sure what you have to do with the registe= >r file, other than to have N of them like all other logic. The stack can b= >e implemented in block RAM.
Like a register file.
By contrast, with a single-threaded approach, you can use the ALU
output latch or the left ALU input latch as the TOS, reducing the
porting requirements or increasing the performance.
A small counter points to the stack being proc=
essed at that time. You can only perform one stack read and one write for = >each processor per instruction. =20
That means that an instruction like + would need two cycles if both
operands come from the block RAM. By contrast, with a single-threaded
stack processor you can use a single-ported SRAM block for the stack
items below the TOS, and still perform + in one cycle.
By contrast, for a single-thread stack-based CPU, what is the=20
forwarding bypass (i.e., an optimization) of a register machine is the=20 >> normal path for the TOS of a stack machine; but not for a barrel=20
processor with a stack architecture.=20
I guess I simply don't know what you mean by "forwarding bypass". I found = >this.=20
https://en.wikipedia.org/wiki/Operand_forwarding
But I don't follow that either. This has to do with the data of the two in= >struction being related. In the barrel stack processor, each phase of the = >processor is an independent instruction stream.Yes, so you throw away the advantage that the stack architecture gives
you:
For a register architecture, the barrel processor approach means that
you don't need to implement the forwarding bypass.
For a sigle-threaded stack architecture, you don't need the data path
of the TOS through the register file/SRAM block (well, not quite, you
need to put the TOS in the register file when you perform an
instruction that just pushes something, but the usual path is directly
from the ALU output to the left ALU input). I discussed the
advantages of that above. A barrel processor approach means that this advantage goes away or at least the whole thing becomes quite a bit
more complex.
Every time the stack is adjusted, the CPU would s=
tall. =20
Does not sound like a competent microarchitectural design to me.
The logic added in pipelining depends on what is pipelined (over in=20
comp.arch Mitch Alsup has explained several times how expensive a=20
deeply pipelined multiplier is: at some design points it's cheaper to=20 >> have two multipliers with half the pipelining that are used in=20
alternating cycles).=20
If you are talking about adding logic for a pipeline, that is some optimiza=
tion you are performing. It's not inherent in the pipelining itself. Pipe= >lining only requires that the logic flow be broken into steps by registers.=
Yes, and these registers are additional logic that costs area. In the
case of the deeply pipelined multiplier there would be so many bits
that would have to be stored in registers for some pipeline stage that
it's cheaper to have a second multiplier with half the pipelining
depth.
On Sunday, April 9, 2023 at 1:46:46=E2=80=AFPM UTC-4, Anton Ertl wrote:
By contrast, with a single-threaded approach, you can use the ALU=20
output latch or the left ALU input latch as the TOS, reducing the=20
porting requirements or increasing the performance.=20
Sorry, I don't know what you mean. You are describing something that is in=
your head, without explaining it. =20
The ALU does not require a register on the output. You can do that, but yo= >u also need multiplexing to allow other sources to reach the TOS register. =
You can try to use the ALU as your mux, but, in reality, that just moves t=
he mux to the input of the ALU. For example, R> needs a data path from the=
return stack to the data stack. That can be input to a mux feeding the TO=
S register, or it can be input to a mux feeding an ALU input. It's a mux, = >either way.=20
That means that an instruction like + would need two cycles if both=20
operands come from the block RAM. By contrast, with a single-threaded=20
stack processor you can use a single-ported SRAM block for the stack=20
items below the TOS, and still perform + in one cycle.=20
I don't know what a single threaded anything is. I don't understand your u= >sage.=20
The TOS can be a separate register from the block ram, OR you can use two p= >orts on the block RAM. I prefer to use a TOS register, and use the two bloc= >k ram ports for read and write, because the addresses are typically differe= >nt.
Sorry, that is not remotely clear to me. Using a pipeline to turn a single=
processor into multiple processors, uses the same logic in the same way, f=
or multiple instruction streams, with no interference.
For a register architecture, the barrel processor approach means that=20
you don't need to implement the forwarding bypass.=20
Which is not needed for the stack processor. What is your point???
Sorry, I have no idea what you are talking about. Why are you talking abou= >t TOS and register files??? Do you mean TOS and stack?=20
Can you reply without the garbage at the ends of lines? What is the =3D20 = >thing?
I have no idea what you are getting at. Of course pipeline registers use s= >pace a chip. Duh! Do you have a point about this
1) In FPGAs, the registers are typically free. They have a register with n= >early every logic element. =20
Lorem Ipsum <gnuarm.del...@gmail.com> writes:
On Sunday, April 9, 2023 at 1:46:46=E2=80=AFPM UTC-4, Anton Ertl wrote:
By contrast, with a single-threaded approach, you can use the ALU=20
output latch or the left ALU input latch as the TOS, reducing the=20
porting requirements or increasing the performance.=20
Sorry, I don't know what you mean. You are describing something that is in=
your head, without explaining it. =20
The ALU does not require a register on the output. You can do that, but yo= >u also need multiplexing to allow other sources to reach the TOS register. =
You can try to use the ALU as your mux, but, in reality, that just moves t=
he mux to the input of the ALU. For example, R> needs a data path from the=
return stack to the data stack. That can be input to a mux feeding the TO=
S register, or it can be input to a mux feeding an ALU input. It's a mux, = >either way.=20
If you really want to avoid that, you can feed all the other stuff
through the ALU on the other input, but yes, it's probably more
efficient to have a mux somewhere in the TOS->ALU->TOS loop.
But the point I was trying to make is that the TOS is not part of the register file (or "block RAM" in FPGA jargon) for the rest of the
stack, and therefore you don't need a register file with two read and
one write port per cycle (which you do for a 1-wide register machine
that should execute 1 instruction per cycle).
That means that an instruction like + would need two cycles if both=20
operands come from the block RAM. By contrast, with a single-threaded=20 >> stack processor you can use a single-ported SRAM block for the stack=20 >> items below the TOS, and still perform + in one cycle.=20
I don't know what a single threaded anything is. I don't understand your u= >sage.=20
That's a normal processor, in contrast to the multi-threaded barrel processor.
The TOS can be a separate register from the block ram, OR you can use two p=
orts on the block RAM. I prefer to use a TOS register, and use the two bloc=
k ram ports for read and write, because the addresses are typically differe=
nt.
Now consider how that changes for a barrel processor.
Sorry, that is not remotely clear to me. Using a pipeline to turn a single=Now you have, say, 8 TOSs, 8 stack pointers, and 8 copies of the rest
processor into multiple processors, uses the same logic in the same way, f=
or multiple instruction streams, with no interference.
of the stack contents. And the 8 TOSs are on the critical path that determines the clock rate. You probably can work around that with
more pipelining, but that increases the design complexity and area.
For a register architecture, the barrel processor approach means that=20 >> you don't need to implement the forwarding bypass.=20
Which is not needed for the stack processor. What is your point???What is the forwarding bypass for a register machine is the TOS in a
stack machine.
Sorry, I have no idea what you are talking about. Why are you talking abou= >t TOS and register files??? Do you mean TOS and stack?=20
The stack is what the programmer sees. In the implementation you
implement the part of the stack that's not the TOS as block RAM in an
FPGA or as register file in custom hardware (plus a stack pointer).
Other options are possible, but these are the ones that are usually
used.
Can you reply without the garbage at the ends of lines? What is the =3D20 = >thing?
That's the quoted-printable garbage that is coming from your Usenet
client. Some clients repair this garbage, but mine doesn't, so it
gets cited like you posted it. You can see in
<http://al.howardknight.net/?STYPE=msgid&MSGI=%3Cec17a8fd-b59b-4e16-b8a7-2225c6a2a9f2n%40googlegroups.com%3E>
<http://al.howardknight.net/?STYPE=msgid&MSGI=%3Cccf8d2a0-6ee4-4896-8f82-e49791c66729n%40googlegroups.com%3E>
how your last two postings has been butchered by your Usenet client.
I have no idea what you are getting at. Of course pipeline registers use s= >pace a chip. Duh! Do you have a point about thisYou claimed that pipelining needs no additional logic, but it does.
1) In FPGAs, the registers are typically free. They have a register with n= >early every logic element. =20
For custom hardware, you have to pay extra for the registers, and they
are not cheap.
These conversations are really interesting.
The microcore has one cpu, but multiple stack regions in memory.
The barrel processor has multiple stacks sharing resources round robin.
The Propeller Parallax has multiple cpus sharing central memory round robin. The transputer has multiple cpus, each with its own memory, and multiple tasks time sliced.
The XMOS chip has several poorly connected cpus, each with multiple sets of registers sharing resources time sliced.
I had originally thought of doing multiple small cpus.
So many choices. I can see now why you ask what is my goal. Then the best choice would be obvious.
My imagination of what is possible has been hugely stretched. My certainty of what I wanted to build has evaporated.
On Sunday, April 9, 2023 at 6:00:36=E2=80=AFPM UTC-4, Anton Ertl wrote:[...]
The point is, the barrel processor does not require much extra logic to run=
N processes, without interference, resulting in a much higher processing r=
ate
I don't know what other people do with processors, but my designs typically=
need to have multiple events monitored and acted on. I don't like dealing= with the potential hazards of conventional multitasking. Running independ=
ent processes on independent processors is ideal for my work. That's what = >a barrel processor gives me. Simple and effective. =20
Now consider how that changes for a barrel processor.=20
So how does it? The TOS is now an N way register or small RAM, just like t= >he rest of the stack.
Instead of asking open ended questions, why not make=
a statement?=20
Can you reply without the garbage at the ends of lines? What is the =3D3= >D20 =3D=20=20
thing?=20
That's the quoted-printable garbage that is coming from your Usenet=20
client. Some clients repair this garbage, but mine doesn't, so it=20
gets cited like you posted it. You can see in=20
It only shows up in your replies. I can't do anything about it. Can you?= >=20
how your last two postings has been butchered by your Usenet client.=20
Ok, I'll just ignore this.=20
The registers are there in any event. The comparison is an N-way barrel pr= >ocessor, or N processors. Same number of registers, but in one case, much = >less logic.=20
If you want to run a single processor, it won't run N times faster, unless = >you pipeline it, adding the registers back. =20
Lorem Ipsum <gnuarm.del...@gmail.com> writes:
On Sunday, April 9, 2023 at 6:00:36=E2=80=AFPM UTC-4, Anton Ertl wrote: [...]
The point is, the barrel processor does not require much extra logic to run=
N processes, without interference, resulting in a much higher processing r=
ate
Where do you get the "much higher processing rate" from, especially
without "much extra logic"? If you just multiply all the stuff
implementing the stack by, say, 8, the clock rate and thus the
processing rate slows down. You need to introduce additional
pipelining (i.e., additional logic) to compensate for that. And once
you have compensated for that, each thread runs at 1/8th of the speed.
If you have memory with a latency >1 cycle, you can pipeline the
memory access with extra logic, and then you can use the
multi-threading to fill the memory access latency. In that case you
would have an increased rate of processing, if you use all 8 threads,
but each individual thread is still dog slow.
As an example, here's the benchmark numbers for Gforth 0.7.0 on two 2005-vintage CPUs (compiled with 2006-vintage compilers):
sieve bubble matrix fib
2.114 2.665 1.494 1.912 0.7.0; UltraSparc T1 1GHz; gcc-4.0.2
0.176 0.244 0.100 0.308 0.7.0; K8 2.2Ghz (Athlon 64 X2 4400+); gcc-4.0.4
The UltraSPARC T1 has 4 threads per core (and <https://en.wikipedia.org/wiki/UltraSPARC_T1> describes it as a barrel processor), while the K8 has only one. Both are implemented in a 90nm process. Admittedly, the UltraSPARC T1 has less area/core (8 cores on 378mm^2) than the Athlon 64 X2 (2 cores on 199mm^2). But if we compute
the throughput per mm^2 when using all threads (assuming perfect
scaling for both, which is more questionable for the UltraSPARC T1),
the Athlon 64 X2 wins with 0.012 executions/(s*mm^2) (executions of
all these benchmarks) compared to 0.010 for the UltraSparc T1.
( T1) 32e 2.114e 2.665e 1.494e 1.912e f+ f+ f+ f/ 379e f/ f.
( K8) 2e 0.176e 0.244e 0.100e 0.308e f+ f+ f+ f/ 199e f/ f.
And of course, when you have less than 32 threads, things look even
worse for the T1. When you have only one thread, it's more than 10
times slower.
And these are both register-machine architectures. There is a reason
why barrel processors have not taken off for CPUs.
I don't know what other people do with processors, but my designs typically=
need to have multiple events monitored and acted on. I don't like dealing= with the potential hazards of conventional multitasking. Running independ=
ent processes on independent processors is ideal for my work. That's what = >a barrel processor gives me. Simple and effective. =20
And the customers of Sun had servers that served multiple customers simulteneously, so Sun thought something like the UltraSPARC T1 would
be simple and effective for them.
Now consider how that changes for a barrel processor.=20
So how does it? The TOS is now an N way register or small RAM, just like t= >he rest of the stack.
Which means that you need more fan-out from the ALU to the TOS's and multiplexing from the TOS's to the ALU, both of which slows down the
cycle time.
Instead of asking open ended questions, why not make=
a statement?=20
I am sorry that I did not know that you needed this all spelled out.
I was expecting that my question was enough to make the additional
costs obvious.
D20 =3D=20Can you reply without the garbage at the ends of lines? What is the =3D3=
thing?=20=20
That's the quoted-printable garbage that is coming from your Usenet=20
client. Some clients repair this garbage, but mine doesn't, so it=20
gets cited like you posted it. You can see in=20
It only shows up in your replies. I can't do anything about it. Can you?= >=20
It's you who complained, so why should I?
If you wanted to do something about it, you might ask what you can do
about it, so I conclude that you don't want to do anything about it.
how your last two postings has been butchered by your Usenet client.=20
Ok, I'll just ignore this.=20
This confirms my conclusion.
The registers are there in any event. The comparison is an N-way barrel pr= >ocessor, or N processors. Same number of registers, but in one case, much = >less logic.=20
If you don't want to incur longer cycle times, and increase the
pipelining for that purpose, you need additional registers.
If you want to run a single processor, it won't run N times faster, unless =
you pipeline it, adding the registers back. =20
For the same amount of pipelining, a single-threaded processor will
run at a higher clock rate than an N-way barrel processor,
and the
barrel processor will process the instructions of a thread at a 1/N per-cycle rate, so one thread will run more than N times faster.
N
threads will still run faster if you switch the threads rarely enough.
You need to add more pipelining (more complexity, more logic, more
area) to get a throughput benefit from barrel processing.
I would recommend that you design a stack processor in an HDL, to learn about the process, and more so, the nature of hardware.
Few software people have a good understanding of what it takes to make good hardware.I think that is very deeply true. Very very different mindsets.
Okay, no one was too excited about my previous proposal, so I am proposing something else.verilog while loops to generate a slow clock pulse. I strongly advise any developer considering designing a chip, to get educated in digital design first.
For my master's thesis at Silesian University of Technology, I am now considering building an 8 core Forth CPU for real time control. Hard real time control on a single cpu is difficult, much better to allocate one cpu to controlling each signal.
This would be a bit like the Propeller Parallax 2, and a bit different. That device uses register machines, this would be based on stack machines.
That device emulates Forth. This would have native Forth instructions.
In that device, each core has
512 longs of dual-port register RAM for code and fast variables;Pairwise Parallax cores can access some of their neighbors registers.
512 longs of dual-port lookup RAM for code, streamer lookup, and variables;
Access to to (1M?) hub RAM every 8 clock cycles.
I would like to make it 8 proper Forth CPU's rather than register machines. Rather than a big central hub memory, I would like each core to have more memory. How much? With two port memories, they could each share 2 * (1/8)th of the memory. (1/4) of the total memory. Not bad.
I would like communication between adjacent cores. I like how the GA144 allows neighboring cores to communicate. That seems important to me.
I wonder if this would be of interest to anyone?
There is a good chance that I would do this in cooperation with the AI and Robotics guys and their CORE 1 CPU. Ting's ep16/24/32 are also interesting. There are a bunch of other cores I need to evaluate as well. Everyone speaks well of the J1.
In other news, school is going well. I am really impressed with the education here. As a software developer, I completely misunderstood how to write Verilog. If I had tried it, it would have been a disaster One software developer famously used nested
Alternatively, you can do something like use the Intel design tools, to lay out components and their connectivity. I am sure that there are other such tools out there. But starting with Verilog, or even vhdl, for a software developer, is bound to cometo an endless stream of problems.
Warm RegardsDon't worry too much about him. Look up my threads for ideas.
Christopher Lozinski
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 475 |
Nodes: | 16 (2 / 14) |
Uptime: | 18:37:21 |
Calls: | 9,487 |
Calls today: | 6 |
Files: | 13,617 |
Messages: | 6,121,092 |