RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://groups.google.com/g/comp.sys.arm/c/KDFGkeBRTPk/m/G5CMf41k-0QJ below:

StrongARM multiply speed

Jan Vlietinck

unread, Dec 9, 1996, 9:00:00 AM12/9/96 

You do not have permission to delete messages in this group

Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message

to

I've done some tests to measure multiply speed on the StrongARM.
It seems that the processor stalls until the multiply has
finished. So no next instructions execute in parallel with the
multiply, unlike data sheets specify. Can anyone confirm this ?

Jan Vlietinck.

Vincent Lefevre

unread, Dec 9, 1996, 9:00:00 AM12/9/96 

You do not have permission to delete messages in this group

Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message

to

I've just done some tests too. The processor does not stall if there's
no dependance with the following instruction. With large arguments,
I obtain: 3 cycles + 1 delay cycle (as said in the other thread about
StrongARM multiply -- the data sheets are wrong).

But if you have the following sequence:
MUL R0, R1, R2
NOP
the processor stalls!!! as NOP is assembled to MOV R0, R0. Maybe the
assemblers should check for such problems and assemble to MOV R1, R1
in such cases... for "symmetry" reasons. :)

--
Vincent Lefevre, vlef...@ens-lyon.fr | Acorn Risc PC, StrongARM @ 202MHz
http://www.ens-lyon.fr/~vlefevre | 20+1MB RAM, Eagle M2, TV + Teletext
PhD in Computer Science, 1st year | Apple CD-300, SyQuest 270MB (SCSI)
-----------------------------------------------------------------------------

Michael Williams

unread, Dec 10, 1996, 9:00:00 AM12/10/96 

You do not have permission to delete messages in this group

Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message

to

In article <58i3kl$

m...@cri.ens-lyon.fr

Vincent Lefevre <

vlef...@ens-lyon.fr

> wrote:

>But if you have the following sequence:

> MUL R0, R1, R2

> NOP

>the processor stalls!!! as NOP is assembled to MOV R0, R0. Maybe the

>assemblers should check for such problems and assemble to MOV R1, R1

>in such cases... for "symmetry" reasons. :)

Surely an assembler that optimises NOP would be one that removes it
entirely?

Mike.

Wilco Dijkstra

unread, Dec 11, 1996, 9:00:00 AM12/11/96 

You do not have permission to delete messages in this group

Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message

to

Michael Williams wrote:

> In article <58i3kl$

m...@cri.ens-lyon.fr

> Vincent Lefevre <

vlef...@ens-lyon.fr

> wrote:

> >But if you have the following sequence:

> > MUL R0, R1, R2

> > NOP

> >the processor stalls!!! as NOP is assembled to MOV R0, R0. Maybe the

> >assemblers should check for such problems and assemble to MOV R1, R1

> >in such cases... for "symmetry" reasons. :)

You can't blame x-year old assemblers for not being SA-aware...

I've never liked the MOV R0, R0 NOOP. Better would be to have a real
NOOP in ARM hardware. But maybe something like CMP pc, pc with the
S bit unset would be OK (if this bit pattern isn't already in use).
This doesn't introduce any data dependencies on the SA. Checking for
data dependencies is not worth it: they will vary from processor to
processor. Eg. the ARM8 seems to have instructions with a 2 cycle
result delay. This clearly means looking backwards 2 instructions,
including possible branch origins... :-(

> Surely an assembler that optimises NOP would be one that removes it
> entirely?

Yeah, that's exactly the problem I have... :-)

My assembler removes all dead code, so the MOV R0,R0 just disappears...
This is clearly *not* desirable, as there is need for a NOOP when you
switch
between processor modes. But because people are used to writing MOV Rx,
Rx
this special case has to be recognized and converted to a real NOOP...
(sigh)

Wilco (opinions my own)
--
-----------------------------------------------------------------------
Wilco Dijkstra Advanced RISC Machines Ltd
Software Engineer Fulbourn Road
Wilco.D...@armltd.co.uk Cherry Hinton, CB1 4JN
Phone: +44 1223 400 518 Cambridge, UK

Owen Smith

unread, Dec 12, 1996, 9:00:00 AM12/12/96 

You do not have permission to delete messages in this group

Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message

to

Wilco Dijkstra <

Wilco.D...@armltd.co.uk

> wrote:

>I've never liked the MOV R0, R0 NOOP. Better would be to have a real
>NOOP in ARM hardware. But maybe something like CMP pc, pc with the
>S bit unset would be OK (if this bit pattern isn't already in use).

CMP without the S bit set is the encoding for MRS and MSR. TEQ without the S
bit set is unused, but we can't switch to using that for NOP because the
StrongARM faults it as an undefined instruction. What about LDM with no
registers to load and no base register write-back? That doesn't do very much,
but does it work on older CPUs? Also it may cause some sort of phantom memory
access and hence take an N or S cycle rather than an I cycle on a cached
processor.

The recommended NOP a long time ago (ARM2 and ARM3 days) was MOVNV r0, r0.
Due to the Never condition code this would allow processors to know that
there are no relevant dependencies, though this is effectively adding special
casing to the otherwise generic conditional execution mechanism. The NV
condition code is now reserved for future expansion and hence the switch to
MOV r0, r0.

I've also seen ANDEQ r0, r0, r0 used as a NOP, first of all because it is and
secondly because the encoding as a word is zero. Personally I'm not keen on
it because zeros are often data; it's a three register instruction (might be
slower on future CPUs) and it's a conditional instruction (again might be
slower if the previous instruction is a delayed result flag setting
instruction).

As Wilco points out, using a normal instruction for NOP really isn't optimal.

By the way, what's the encoding for STRH and LDRH and the new longer
multiplies? Perhaps STRH and LDRH are encoded using the NV condition code
since I believe you can't put condition codes on STRH and LDRH. LDR and STR
with the NV condition perhaps? That would prevent them from doing anything on
older CPUs that don't support half word access - though this is an debatable
bonus.

--
Owen (owen....@net-tel.co.uk)
If you choose to attribute my words to NET-TEL that's your problem.........77

Wilco Dijkstra

unread, Dec 13, 1996, 9:00:00 AM12/13/96 

You do not have permission to delete messages in this group

Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message

to

Owen Smith wrote:
> >I've never liked the MOV R0, R0 NOOP. Better would be to have a real
> >NOOP in ARM hardware. But maybe something like CMP pc, pc with the
> >S bit unset would be OK (if this bit pattern isn't already in use).
>
> CMP without the S bit set is the encoding for MRS and MSR.

Well, actually only _some_ CMP instructions overlap with MRS/MSR...
CMP PC, R0 without the S-bit is actually MRS R0, SPSR, but CMP PC, PC
without the S-bit doesn't overlap any other instruction as far as I
can determine.
This doesn't mean it'll work: this depends on how instructions are
categorized by the hardware: maybe it thinks the CMP is a special
kind of MRS...

> What about LDM with no
> registers to load and no base register write-back? That doesn't do very much,
> but does it work on older CPUs? Also it may cause some sort of phantom memory
> access and hence take an N or S cycle rather than an I cycle on a cached
> processor.

Yes, I suppose the hardware will just try to start an LDM. I guess that
it will always transfer a register, so you might end up loading random
data in a random register, far worse than an extra memory access...

> The recommended NOP a long time ago (ARM2 and ARM3 days) was MOVNV r0, r0.
> Due to the Never condition code this would allow processors to know that
> there are no relevant dependencies, though this is effectively adding special
> casing to the otherwise generic conditional execution mechanism.

Never executing an instruction is effectively a NOOP, but an expensive
NOOP:
there are (were) 2^28 possible NOOPs, which is a bit too much. But now
there
are 0 which is a bit too little :-)

> I've also seen ANDEQ r0, r0, r0 used as a NOP, first of all because it is and
> secondly because the encoding as a word is zero. Personally I'm not keen on
> it because zeros are often data; it's a three register instruction (might be
> slower on future CPUs) and it's a conditional instruction (again might be
> slower if the previous instruction is a delayed result flag setting
> instruction).

Quite correct, although AND isn't likely to get slower. ADDs might
indeed
get a result delay, so use ORR when you merge 2 values in a register.
I don't think the ANDEQ will be slower if the previous instr sets the
condition codes. Most probably, the instr after it (whether
condionalized
or not) stalls until the flags are updated. At least this is what the SA
does when you use MULS.

> By the way, what's the encoding for STRH and LDRH and the new longer
> multiplies? Perhaps STRH and LDRH are encoded using the NV condition code
> since I believe you can't put condition codes on STRH and LDRH.

No, LDRH, STRH, LDRSB, U/SMULL, U/SMLAL all use the unused combination
of
bit 25-27 cleared, and bit 4 & 7 set (like SWP).
So you can use condition codes, but for the signed byte load and
halfword
operations there is only a 8 bit offset and no shift available.
Currently there are no instructions which use the NV-condition.

Michael Williams

unread, Dec 13, 1996, 9:00:00 AM12/13/96 

You do not have permission to delete messages in this group

Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message

to

Owen Smith <

Owen....@net-tel.co.uk

> wrote:

>By the way, what's the encoding for STRH and LDRH and the new longer

>multiplies?

I'm sure I've posted this recently before...

LDRH et al:

<cond> 000 P U 1 W L <Rn> <Rd> HiOff 1 S H 1 LoOff
<cond> 000 P U 0 W L <Rn> <Rd> 0000 1 S H 1 <Rm>

P => pre/post index
U => up/down
W => writeback (if P==1), should be zero if P=0
L => load/store
S => signed (SBZ for L=1, SBO for H=0)
H => halfword/byte

N.B. SH = %00 (unsigned byte) encodes SWP/SWPB, multiply etc.
HiOff + LoOff form an 8-bit immediate offset

multiplies:

U => unsigned
A => accumulate
S => set flags

Mike.

Torben AEgidius Mogensen

unread, Dec 19, 1996, 9:00:00 AM12/19/96 

You do not have permission to delete messages in this group

Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message

to

Wilco Dijkstra <

Wilco.D...@armltd.co.uk

> writes:

>Owen Smith wrote:

>> The recommended NOP a long time ago (ARM2 and ARM3 days) was MOVNV r0, r0.
>> Due to the Never condition code this would allow processors to know that
>> there are no relevant dependencies, though this is effectively adding special
>> casing to the otherwise generic conditional execution mechanism.

>Never executing an instruction is effectively a NOOP, but an expensive
>NOOP:
>there are (were) 2^28 possible NOOPs, which is a bit too much. But now
>there
>are 0 which is a bit too little :-)

Reserving the NV condition code for expansion is a good idea, but only
if you actually get around to using the exapansion potential. ARM has
sat on these bits for a long time without doing anything with them.

IMHO, using the NV code to signal a completely different decoding of
the remaining instruction bits is (while potentially powerful) not an
aesthetic solution. I would much prefer that the NC code still signals
conditional execution. Just adding another flag combination will not
be very useful, though, so I have another idea (which I on several
previous occasions have aired in the comp.sys.acorn groups).

The idea is to let the NV bits signal that the instruction is executed
if and only if the most recent instruction was. This would allow a
sequence of instructions that are conditional on the same code to
change the flags. A simple example is conditional addition of two 64
bit numbers. 64 bit addition requires setting the carry flag after the
addition of the least significant parts and use this for a ADC of the
most significant parts:

ADDS R4,R0,R2
ADC R5,R1,R3

Doing this conditionally (e.g. on the EQ condition) would normally
require a branch

BNE l1
ADDS R4,R0,R2
ADC R5,R1,R3
l1:

as the ADDS would change the zero flag. With the proposed meaning of
the NV condition code (which we can call PR for "previous") we can
simply do

ADDSEQ R4,R0,R2
ADCPR R5,R1,R3

In general, it would be more pleasing to use PR whenever a group of
instructions are conditional instead of just repeating the condition
code, as it makes it clearer that the instructions are grouped. It
could also potentially be more efficient, as it would be eaiser for
the processor to detect dependencies on the flags.

Implementation using the present pipeline structure(s) is simple: just
add an extra P-flag bit to the PSW. This flag is set whenever a
condition evaluates to true and unset whenever a condition evaluates
to false. The PR condition is then just another condition which just
tests P=1. The main difference to the other flags is that the P flag
is updated even when the S bit is not set.

But the PR code could also be implemented directly in the pipeline
instead of using a flag. It isn't clear anyway that it would be an
advantage to be able to set/clear the P flag using MSR and MRS. With a
direct implementation the independence from the PSW can be exploited
to obtain better parallelism in a superscalar implementation. It can
also be used for fast skipping of a whole group of conditional
instructions that won't be executed. While this is also possible by
comparing the condition code with that of the previous instruction, it
will be simpler (faster) to make a comparison with 1111.

In the description above, the status of the previous instruction is
the one used for the PR condition. You could also consider letting it
be the status of the most recent conditional instruction (i.e.
condition != AL). This is strictly more powerful (as the PR condition
will always be true after an unconditional instruction). This would
allow interleaving instructions conditional on PR with unconditional
instructions, which might be good for scheduling. Let us, for example,
assume that we conditionally wnat to load a register and then
increment the result. This would be e.g.

LDREQ R0, [R1]
ADDPR R0, R0, #1

If this is followed by an independent (and unconditional) instruction,
e.g.

ADDS R1, R1, R2, LSL#1

it would improve the efficiency of the program if this could be
scheduled betwen the two above instructions, i.e.

LDREQ R0, [R1]
ADDS R1, R1, R2, LSL#1
ADDPR R0, R0, #1

Note that even though the inserted instruction changes the flags, it
is unconditional and hence doesn't change the status of the PR
condition.

It could also be interesting to have the negation of the PR condition
(NP), but there is not room for both. With NP you could interleave the
two branches of a conditional:

ADDEQ R0,R0,#1
ORRNP R0,R0,#3
ADDNP R1,R1,R0
ANDNP R1,R1,R0

instead of

ADDEQ R0,R0,#1
ADDPR R1,R1,R0
ORRNE R0,R0,#3
ANDPR R1,R1,R0

Something similar to this is done on PA-RISC using conditional skip of
next instruction. If I should choose either PR or NP, I think I would
prefer PR, though.

Torben Mogensen (tor...@diku.dk)

Paul Clark

unread, Dec 20, 1996, 9:00:00 AM12/20/96 

You do not have permission to delete messages in this group

Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message

to

Torben AEgidius Mogensen wrote:
> It could also be interesting to have the negation of the PR condition
> (NP), but there is not room for both. With NP you could interleave the
> two branches of a conditional:
>

> [snip]

>
> Something similar to this is done on PA-RISC using conditional skip of
> next instruction. If I should choose either PR or NP, I think I would
> prefer PR, though.

However, NP would still give you the desired NOP effect when used after
a non-conditional instruction, so is somewhat more backwards compatible.

There is still the potential for breaking some code in abstruse cases,
though, (NV after an unsuccessful conditional?), so it might be
politically unacceptable.
--
Paul Clark mailto:p...@sysmag.com
Systems Magic Ltd. http://www.sysmag.com

Wilco Dijkstra

unread, Dec 20, 1996, 9:00:00 AM12/20/96 

You do not have permission to delete messages in this group

Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message

to

Torben AEgidius Mogensen wrote:
> IMHO, using the NV code to signal a completely different decoding of
> the remaining instruction bits is (while potentially powerful) not an
> aesthetic solution. I would much prefer that the NC code still signals
> conditional execution. Just adding another flag combination will not
> be very useful, though, so I have another idea (which I on several
> previous occasions have aired in the comp.sys.acorn groups).

Right, non-conditionalisable instructions would destroy the nice
orthogonality of the ARM instruction set.

> The idea is to let the NV bits signal that the instruction is executed
> if and only if the most recent instruction was. This would allow a
> sequence of instructions that are conditional on the same code to
> change the flags. A simple example is conditional addition of two 64
> bit numbers. 64 bit addition requires setting the carry flag after the
> addition of the least significant parts and use this for a ADC of the
> most significant parts:

Doing this conditionally (e.g. on the EQ condition) would normally
> require a branch
>
> BNE l1
> ADDS R4,R0,R2
> ADC R5,R1,R3
> l1:
>
> as the ADDS would change the zero flag. With the proposed meaning of
> the NV condition code (which we can call PR for "previous") we can
> simply do
>
> ADDSEQ R4,R0,R2
> ADCPR R5,R1,R3
>

A nice idea! Effectively this means you 'lock' the flags, so there
are two sets of flags. The PowerPC, for example has 8 4 bit flag
registers, so you can set one of the 8 flags and then reuse them later
on.
I think that when 64 bit ints are becoming more widely used, your idea
could make a significant difference in code size and efficiency.

> Implementation using the present pipeline structure(s) is simple: just

> add an extra P-flag bit to the PSW. This flag is set whenever a
> condition evaluates to true and unset whenever a condition evaluates
> to false. The PR condition is then just another condition which just
> tests P=1. The main difference to the other flags is that the P flag
> is updated even when the S bit is not set.

Right, this works fine, and shouldn't be hard to implement.

> But the PR code could also be implemented directly in the pipeline
> instead of using a flag. It isn't clear anyway that it would be an
> advantage to be able to set/clear the P flag using MSR and MRS.

Well, at least I'd like to have my PR instructions execute equally
well if there was an interrupt between two conditional instructions!
The pipeline state isn't saved on interrupts...
So you need the P bit in the PSR registers.

It can
> also be used for fast skipping of a whole group of conditional
> instructions that won't be executed. While this is also possible by
> comparing the condition code with that of the previous instruction, it

It's even harder, as you also need to check whether the previous
instruction is updating the flags, or for deeper pipelines the
instruction before that...

> will be simpler (faster) to make a comparison with 1111.

Indeed, now you know instantly that you can either skip or execute.

> In the description above, the status of the previous instruction is

> the one used for the PR condition. You could also consider letting it
> be the status of the most recent conditional instruction (i.e.
> condition != AL). This is strictly more powerful (as the PR condition
> will always be true after an unconditional instruction). This would
> allow interleaving instructions conditional on PR with unconditional
> instructions, which might be good for scheduling.

Indeed, one of the things I tried out was instr skipping with only one
instruction decoder (thus skipping in the instr buffer). This causes
a bubble in the pipeline if the previous instr sets the flags.
So in this case you want to have a non-conditional instruction after
it, giving the instr buffer an extra cycle to skip. Since the ARM
executes many conditional instructions, even a simple implementation
which can skip max 1 instr per cycle will give a very significant
speed & power improvement.

> It could also be interesting to have the negation of the PR condition

> (NP), but there is not room for both. With NP you could interleave the
> two branches of a conditional

> Something similar to this is done on PA-RISC using conditional skip of
> next instruction. If I should choose either PR or NP, I think I would
> prefer PR, though.

Agreed.

A remark on the possible complications. For example, do we allow this:

ADDS x, x, #1
RSBMI x, x, #0
loop
ADDPR y, y, x
CMP x, y
BEQ loop

Now the ADDPR means either ADDMI (first iteration) but ADDEQ for the
next iterations. Certainly potential for hackers!

Cheers,

Wilco

Torben AEgidius Mogensen

unread, Dec 20, 1996, 9:00:00 AM12/20/96 

You do not have permission to delete messages in this group

Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message

to

Wilco Dijkstra <

Wilco.D...@armltd.co.uk

> writes:

>Torben AEgidius Mogensen wrote:

>> But the PR code could also be implemented directly in the pipeline
>> instead of using a flag. It isn't clear anyway that it would be an
>> advantage to be able to set/clear the P flag using MSR and MRS.

>Well, at least I'd like to have my PR instructions execute equally
>well if there was an interrupt between two conditional instructions!
>The pipeline state isn't saved on interrupts...
>So you need the P bit in the PSR registers.

True. I hadn't thought about this.

>A remark on the possible complications. For example, do we allow this:

> ADDS x, x, #1
> RSBMI x, x, #0
>loop
> ADDPR y, y, x
> CMP x, y
> BEQ loop

>Now the ADDPR means either ADDMI (first iteration) but ADDEQ for the
>next iterations. Certainly potential for hackers!

It was my intention to let it be the dynamic instruction flow that
determines the status of the P flag. Hence, the above would be legal.
In this example the ADDPR instruction would be certain to be executed
in all but the first iteration, where it is dependent on the RSBMI
instruction. I don't think this behaviour will complicate the
implementation. Getting a static dependency would, however.

And if there is potential for hackers, there is also potential for
optimizing compilers. ;-)

Torben Mogensen (tor...@diku.dk)

Darren Salt

unread, Dec 21, 1996, 9:00:00 AM12/21/96 

You do not have permission to delete messages in this group

Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message

to

In article <

32BA6D...@armltd.co.uk

Wilco Dijkstra <

Wilco.D...@armltd.co.uk

> wrote:

[snip]

> So you need the P bit in the PSR registers.

Right. Let's see you implement this in 26-bit mode ;->

[snip]

> A remark on the possible complications. For example, do we allow this:

> ADDS x, x, #1
> RSBMI x, x, #0
> loop
> ADDPR y, y, x
> CMP x, y
> BEQ loop

> Now the ADDPR means either ADDMI (first iteration) but ADDEQ for the
> next iterations. Certainly potential for hackers!

I don't see a way to disallow it. And yes, it could leave code open to some
interesting interpretations...

Don't let your superiors know you're superior to them.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4