Forum: >>> Magnum BBS <<<

Re: Tonights Tradeoff

From MitchAlsup1@21:1/5 to Robert Finch on Sat Sep 7 14:41:14 2024

On Sat, 7 Sep 2024 2:27:40 +0000, Robert Finch wrote:

Making the scalar register file a subset of the vector register file.
And renaming only vector elements.

There are eight elements in a vector register and each element is
128-bits wide. (Corresponding to the size of a GPR). Vector register
file elements are subject to register renaming to allow the full power
of the OoO machine to be used to process vectors. The issue is that with
both the vector and scalar registers present for renaming there are a
lot of registers to rename. It is desirable to keep the number of
renamed registers (including vector elements) <= 256 total. So, the 64
scalar registers are aliased with the first eight vector registers.
Leaving only 24 truly available vector registers. Hm. There are 1024
physical registers, so maybe going up to about 300 renamable register
would not hurt.

Why do you think a vector register file is the way to go ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Sun Sep 8 18:06:48 2024

On Sun, 8 Sep 2024 3:22:55 +0000, Robert Finch wrote:

On 2024-09-07 10:41 a.m., MitchAlsup1 wrote:

On Sat, 7 Sep 2024 2:27:40 +0000, Robert Finch wrote:

Making the scalar register file a subset of the vector register file.
And renaming only vector elements.

There are eight elements in a vector register and each element is
128-bits wide. (Corresponding to the size of a GPR). Vector register
file elements are subject to register renaming to allow the full power
of the OoO machine to be used to process vectors. The issue is that with >>> both the vector and scalar registers present for renaming there are a
lot of registers to rename. It is desirable to keep the number of
renamed registers (including vector elements) <= 256 total. So, the 64
scalar registers are aliased with the first eight vector registers.
Leaving only 24 truly available vector registers. Hm. There are 1024
physical registers, so maybe going up to about 300 renamable register
would not hurt.

Why do you think a vector register file is the way to go ??

I think vector use is somewhat dubious, but they have some uses. In many cases data can be processed just fine without vector registers. In the current project vector instructions use the scalar functional units to compute, making them no faster than scalar calcs. But vectors have a lot
of code density where parallel computation on multiple data items using
a single instruction is desirable. I do not know why people use vector registers in general, but they are present in some modern architectures.

There is no doubt that much code can utilize vector arrangements, and
that a processor should be very efficient in performing these work
loads.

The problem I see is that CRAY-like vectors vectorize instructions
instead of vectorizing loops. Any kind of flow control within the
loop becomes tedious at best.

On the other hand, the Virtual Vector Method vectorizes loops and
can be implemented such that it performs as well as CRAY-like
vector machines without the overhead of a vector register file.
In actuality there are only 6-bits of HW flip-flops governing
VVM--compared to 4 KBytes for CRAY-1.

Qupls vector registers are 512 bits wide (8 64-bit elements). Bigfoot’s vector registers are 1024 bits wide (8 128-bit elements).

When properly abstracted, one can dedicate as many or few HW
flip-flops as staging buffers for vector work loads to suit
the implementation at hand. A GBOoO may utilize that 4KB
file of CRAY-1 while the little low power core 3-cache lines.
Both run the same ASM code and both are efficient in their own
sense of "efficient".

So, instead of having ~500 vector instructions and ~1000 SIMD
instructions one has 2 instructions and a medium scale state
machine.

One use I am considering is the graphics transform function for doing
rotates and translates of pixels. It uses a 3x4 matrix. ATM this is done
with specially dedicated registers, but the matrix could be fit into a
vector register and the transform function applied with it. Another use
is neural net instructions.

I added a fixed length vector type to the compiler to make it easier to experiment with vectors.

The processor handles vector instructions by replicating them one to
eight times depending on the vector length. It then fixes up the
register spec fields with incrementing register numbers for each
instruction. They get fed into the remainder of the CPU as a series of
scalar instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Robert Finch on Wed Sep 11 08:48:03 2024

On 9/11/2024 6:54 AM, Robert Finch wrote:

snip

I have found that there can be a lot of registers available if they are implemented in BRAMs. BRAMs have lots of depth compared to LUT RAMs.
BRAMs have a one cycle latency but that is just part of the pipeline. In
Q+ about 40k LUTs are being used just to keep track of registers.
(rename mappings and checkpoints).

Given a lot of available registers I keep considering trying a VLIW
design similar to the Itanium, rotating register and all. But I have a
lot invested in OoO.

Q+ has seven in-order pipeline stages before things get to the re-order buffer.

Does each of these take a clock cycle? If so, that seems excessive.
What is your cost for a mis-predicted branch?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Sep 11 21:28:51 2024

On Tue, 10 Sep 2024 7:00:00 +0000, BGB wrote:

On 9/9/2024 10:59 PM, Robert Finch wrote:

Still trying to grasp the virtual vector method. Been wondering if it
can be implemented using renamed registers.

I haven't really understood how it could be implemented.
But, granted, my pipeline design is relatively simplistic, and my
priority had usually been trying to make a "fast but cheap and simple" pipeline, rather than a "clever" pipeline.

"Good, Fast, Cheap; choose any 2" Lee Higbe 1982.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Wed Sep 11 21:30:31 2024

On Tue, 10 Sep 2024 14:58:30 +0000, Robert Finch wrote:

On 2024-09-10 3:00 a.m., BGB wrote:

I am not as much a fan of RISC-V's 'V' extension mostly in that it would
require essentially doubling the size of the register file.

The register file in Q+ is huge. One of the drawbacks of supporting
vectors. There were 1024 physical registers for support. Reduced it to
512 and that still may be too many. There was a 4kb wide mapping ram, resulting in a warning message. I may have to split up components into multiple copies to get the desired size to work.

VVM supports vectors as big as the implementation can handle with a
total
cost of 6-bits of state.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Wed Sep 11 21:32:34 2024

On Wed, 11 Sep 2024 15:48:03 +0000, Stephen Fuld wrote:

On 9/11/2024 6:54 AM, Robert Finch wrote:

snip

I have found that there can be a lot of registers available if they are
implemented in BRAMs. BRAMs have lots of depth compared to LUT RAMs.
BRAMs have a one cycle latency but that is just part of the pipeline. In
Q+ about 40k LUTs are being used just to keep track of registers.
(rename mappings and checkpoints).

Given a lot of available registers I keep considering trying a VLIW
design similar to the Itanium, rotating register and all. But I have a
lot invested in OoO.

Q+ has seven in-order pipeline stages before things get to the re-order
buffer.

So does the RISC-V BOOM.

Does each of these take a clock cycle? If so, that seems excessive.
What is your cost for a mis-predicted branch?

I have My 66000 decoder at 4 stages (stage 4 does rename of up to 6 instructions) with the first 3 fetching and parsing instructions
{along with predicting flow control.}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Wed Sep 11 21:27:21 2024

On Tue, 10 Sep 2024 3:59:05 +0000, Robert Finch wrote:

On 2024-09-08 2:06 p.m., MitchAlsup1 wrote:

On Sun, 8 Sep 2024 3:22:55 +0000, Robert Finch wrote:

On 2024-09-07 10:41 a.m., MitchAlsup1 wrote:

On Sat, 7 Sep 2024 2:27:40 +0000, Robert Finch wrote:

Making the scalar register file a subset of the vector register file. >>>>> And renaming only vector elements.

There are eight elements in a vector register and each element is
128-bits wide. (Corresponding to the size of a GPR). Vector register >>>>> file elements are subject to register renaming to allow the full power >>>>> of the OoO machine to be used to process vectors. The issue is that
with
both the vector and scalar registers present for renaming there are a >>>>> lot of registers to rename. It is desirable to keep the number of
renamed registers (including vector elements) <= 256 total. So, the 64 >>>>> scalar registers are aliased with the first eight vector registers.
Leaving only 24 truly available vector registers. Hm. There are 1024 >>>>> physical registers, so maybe going up to about 300 renamable register >>>>> would not hurt.

Why do you think a vector register file is the way to go ??

I think vector use is somewhat dubious, but they have some uses. In many >>> cases data can be processed just fine without vector registers. In the
current project vector instructions use the scalar functional units to
compute, making them no faster than scalar calcs. But vectors have a lot >>> of code density where parallel computation on multiple data items using
a single instruction is desirable. I do not know why people use vector
registers in general, but they are present in some modern architectures.

There is no doubt that much code can utilize vector arrangements, and
that a processor should be very efficient in performing these work
loads.

The problem I see is that CRAY-like vectors vectorize instructions
instead of vectorizing loops. Any kind of flow control within the
loop becomes tedious at best.

On the other hand, the Virtual Vector Method vectorizes loops and
can be implemented such that it performs as well as CRAY-like
vector machines without the overhead of a vector register file.
In actuality there are only 6-bits of HW flip-flops governing
VVM--compared to 4 KBytes for CRAY-1.

Qupls vector registers are 512 bits wide (8 64-bit elements). Bigfoot’s >>> vector registers are 1024 bits wide (8 128-bit elements).

When properly abstracted, one can dedicate as many or few HW
flip-flops as staging buffers for vector work loads to suit
the implementation at hand. A GBOoO may utilize that 4KB
file of CRAY-1 while the little low power core 3-cache lines.
Both run the same ASM code and both are efficient in their own
sense of "efficient".

So, instead of having ~500 vector instructions and ~1000 SIMD
instructions one has 2 instructions and a medium scale state
machine.

Still trying to grasp the virtual vector method. Been wondering if it
can be implemented using renamed registers.

Think of VVM as a set (8) of staging flip-flops taking data (line) from
L1
and feeding it into 4-wide ALUs then back into another set (4)
flip-flops
which deliver data to L1; with wide muxes to get the LD data aligned
with the SLU and the ALU result aligned back to L1.

Then support this infrastructure with a reservation station-like queue
which can advance (1,2,4) iterations per clock.

The registers named in the asm are named into the staging flip-flops
{like renaming} and the whole thing optimized for multi-lane execution
with 6-bits of total overhead.

Qupls has RISC-V style vector / SIMD registers. For Q+ every instruction
can be a vector instruction, as there are bits indicating which
registers are vector registers in the instruction. All the scalar instructions become vector. This cuts down on some of the bloat in the
ISA. There is only a handful of vector specific instructions (about
eight I think). The drawback is that the ISA is 48-bits wide. However,
the code bloat is less than 50% as some instructions have
dual-operations. Branches can increment or decrement and loop. Bigfoot
uses a postfix word to indicate to use the vector form of the
instruction. Bigfoot’s code density is a lot better being variable
length, but I suspect it will not run as fast. Bigfoot and Q+ share a
lot of the same code. Trying to make the guts of the cores generic.

Too bad...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Thu Sep 12 05:37:36 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Tue, 10 Sep 2024 7:00:00 +0000, BGB wrote:

On 9/9/2024 10:59 PM, Robert Finch wrote:

Still trying to grasp the virtual vector method. Been wondering if it
can be implemented using renamed registers.

I haven't really understood how it could be implemented.
But, granted, my pipeline design is relatively simplistic, and my
priority had usually been trying to make a "fast but cheap and simple"
pipeline, rather than a "clever" pipeline.

"Good, Fast, Cheap; choose any 2" Lee Higbe 1982.

Still beeter than "Good, Fast, Cheap: Chose one."

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Thu Sep 12 16:46:04 2024

On Thu, 12 Sep 2024 3:37:22 +0000, Robert Finch wrote:

On 2024-09-11 11:48 a.m., Stephen Fuld wrote:

On 9/11/2024 6:54 AM, Robert Finch wrote:

snip

I have found that there can be a lot of registers available if they
are implemented in BRAMs. BRAMs have lots of depth compared to LUT
RAMs. BRAMs have a one cycle latency but that is just part of the
pipeline. In Q+ about 40k LUTs are being used just to keep track of
registers. (rename mappings and checkpoints).

Given a lot of available registers I keep considering trying a VLIW
design similar to the Itanium, rotating register and all. But I have a
lot invested in OoO.

Q+ has seven in-order pipeline stages before things get to the re-
order buffer.

Does each of these take a clock cycle? If so, that seems excessive.
What is your cost for a mis-predicted branch?

Each stage takes one clock cycle. Unconditional branches are detected at
the second stage and taken then so they do not consume as many clocks.
There are two extra stages to handle vector instructions. Those two
stages could be removed if vectors are not needed.

Mis-predicted branches are really expensive. They take about six clocks,
plus the seven clocks to refill the pipeline, so it is about 13 clocks.
Seems like it should be possible to reduce the number of clocks of
processing during the miss, but I have not got around to it yet. There
is a branch miss state machine that restores the checkpoint. Branches
need a lot of work yet.

In a machine I did in 1990-2 we would fetch down the alternate path
and put the recovery instructions in a buffer, so when a branch was mispredicted, the instructions were already present.

So, you can't help the 6 cycles of branch verification latency,
but you can fix the pipeline refill latency.

We got 2.05 i/c on XLISP SPECnit 89 mostly because of the low backup
overhead.

I am not sure how well the branch prediction works. Instruction runs in
SIM are not long enough yet. Something in the AGEN/TLB/LSQ is not
working correctly yet, leading to bad memory cycles.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Thu Sep 12 20:46:35 2024

On Thu, 12 Sep 2024 19:28:19 +0000, Robert Finch wrote:

On 2024-09-12 12:46 p.m., MitchAlsup1 wrote:

On Thu, 12 Sep 2024 3:37:22 +0000, Robert Finch wrote:

On 2024-09-11 11:48 a.m., Stephen Fuld wrote:

On 9/11/2024 6:54 AM, Robert Finch wrote:

snip

I have found that there can be a lot of registers available if they
are implemented in BRAMs. BRAMs have lots of depth compared to LUT
RAMs. BRAMs have a one cycle latency but that is just part of the
pipeline. In Q+ about 40k LUTs are being used just to keep track of
registers. (rename mappings and checkpoints).

Given a lot of available registers I keep considering trying a VLIW
design similar to the Itanium, rotating register and all. But I have a >>>>> lot invested in OoO.

Q+ has seven in-order pipeline stages before things get to the re-
order buffer.

Does each of these take a clock cycle? If so, that seems excessive.
What is your cost for a mis-predicted branch?

Each stage takes one clock cycle. Unconditional branches are detected at >>> the second stage and taken then so they do not consume as many clocks.
There are two extra stages to handle vector instructions. Those two
stages could be removed if vectors are not needed.

Mis-predicted branches are really expensive. They take about six clocks, >>> plus the seven clocks to refill the pipeline, so it is about 13 clocks.
Seems like it should be possible to reduce the number of clocks of
processing during the miss, but I have not got around to it yet. There
is a branch miss state machine that restores the checkpoint. Branches
need a lot of work yet.

In a machine I did in 1990-2 we would fetch down the alternate path
and put the recovery instructions in a buffer, so when a branch was
mispredicted, the instructions were already present.

So, you can't help the 6 cycles of branch verification latency,
but you can fix the pipeline refill latency.

We got 2.05 i/c on XLISP SPECnit 89 mostly because of the low backup
overhead.

That sounds like a good idea. The fetch typically idles for a few cycles
as it can fetch more instructions than can be consumed in a single
cycle. So, while it’s idling it could be fetching down an alternate
path. Part of the pipeline would need to be replicated doubling up on
the size. Then an A/B switch happens which selects the right pipeline.

You want the alternate path buffer to be staged up ready to go. You
do not necessarily have to dedicate any post decode pipeline stages to
them. You can fetch these from the buffer indexed by branch number,
so when the branch fires to execute the verify stuff, you are fetching
the backup instructions.

You CAN use the renamer state after the previously issued group so the
buffer contains already renamed registers--If you back up and use these instructions you threw away the post issue parts of the renamer anyway.
This enables you to take a cycle to back up the renamer without penalty.

Would not want to queue to the reorder buffer from the alternate path,
as there is a bit of a bottleneck at queue. Not wondering what to do
about multiple branches. Multiple pipelines and more switches? Front-end would look like a pipeline tree to handle multiple outstanding branches.

Was wondering what to do with the extra fetch bandwidth. Fetching two cache-lines at once means there may have been up to 21 instructions
fetched. But its only a four-wide machine.

For my 6-wide machine I am fetching 1/2 a cache line twice for the
sequential path and 1/2 a cache line for the alternate path from
an 8 banked ICache.

I was going to try and feed multiple cores from the same cache. Core A is performance, core B is
average, and core C is economy using left-over bandwidth from A and B.

ARM's big little strategy, with some power philosophy, gives you that
average as a little core running at high voltage and frequency.

I can code the alternate path fetch up and try it in SIM, but it is too
large for my FPGA right now. Another config option. Might put the switch before the rename stage. Nothing like squeezing a mega-LUT design into
100k LUTs. Getting a feel for the size of things. A two-wide in-order
core would easily fit. Even a simple two-wide out-of-order core would
likely fit, if one stuck to 32-bits and a RISC instruction set. A
four-wide OoO core with lots of features is pushing it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Fri Sep 13 11:08:42 2024

MitchAlsup1 wrote:

On Thu, 12 Sep 2024 19:28:19 +0000, Robert Finch wrote:

Would not want to queue to the reorder buffer from the alternate path,
as there is a bit of a bottleneck at queue. Not wondering what to do
about multiple branches. Multiple pipelines and more switches? Front-end
would look like a pipeline tree to handle multiple outstanding branches.

Was wondering what to do with the extra fetch bandwidth. Fetching two
cache-lines at once means there may have been up to 21 instructions
fetched. But its only a four-wide machine.

For my 6-wide machine I am fetching 1/2 a cache line twice for the
sequential path and 1/2 a cache line for the alternate path from
an 8 banked ICache.

Why 8 banks if you are fetching just three 32-byte buffers at once?
I suppose 8 minimizes the chance of colliding on a bank access.
Still, it seems like 4 banks would be sufficient.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Fri Sep 13 17:09:48 2024

On Fri, 13 Sep 2024 15:08:42 +0000, EricP wrote:

MitchAlsup1 wrote:

On Thu, 12 Sep 2024 19:28:19 +0000, Robert Finch wrote:

Would not want to queue to the reorder buffer from the alternate path,
as there is a bit of a bottleneck at queue. Not wondering what to do
about multiple branches. Multiple pipelines and more switches? Front-end >>> would look like a pipeline tree to handle multiple outstanding branches. >>>
Was wondering what to do with the extra fetch bandwidth. Fetching two
cache-lines at once means there may have been up to 21 instructions
fetched. But its only a four-wide machine.

For my 6-wide machine I am fetching 1/2 a cache line twice for the
sequential path and 1/2 a cache line for the alternate path from
an 8 banked ICache.

Why 8 banks if you are fetching just three 32-byte buffers at once?
I suppose 8 minimizes the chance of colliding on a bank access.
Still, it seems like 4 banks would be sufficient.

3 banks for the predicted fetch stuff, 1-2 banks for the mispredicted
fetches.

You not only fetch instructions on the predicted directions, you fetch instructions on the predicted non-taken directions so they are ready
for insertion should that branch need backup.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	01:19:23
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,728

Re: Tonights Tradeoff

Who's Online

Recent Visitors

System Info