[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[oc] Re: Intel Hyper-threading and CPU design, threads etc...

To: cores@opencores.org
Subject: [oc] Re: Intel Hyper-threading and CPU design, threads etc...
From: Andreas Bombe <bombe@informatik.tu-muenchen.de>
Date: Sun, 9 Dec 2001 03:25:37 +0100
In-Reply-To: <MABBKBJDMOPPBNOPHDKICECGCDAA.paul.mcfeeters@ntlworld.com>
Mail-Followup-To: cores@opencores.org
References: <007501c17f85$471ac840$0100a8c0@mshome.net> <MABBKBJDMOPPBNOPHDKICECGCDAA.paul.mcfeeters@ntlworld.com>
Reply-To: cores@opencores.org
Sender: owner-cores@opencores.org
User-Agent: Mutt/1.3.24i

On Sat, Dec 08, 2001 at 05:57:10AM -0000, Paul McFeeters wrote:
> Jim,
> 
> Surely people have already done this "two independent processors on one
> chip"? I've already expressed my wish to load 4 copies of a processor core
> into one FPGA, just waiting for the Damjan to tell me that the OpenRISC
> project is small enough to fit two cores on a 200K Spartan II. Of course the
> problem still comes to concurrent memory access but even that can be almost
> eliminated very cheaply with a little thought.

With taking care that separate threads do not want to access the same
memory locations at the same time.  Plus you need either cache, separate
memory banks that do not block each other on access or just many-port
RAM.

>                                                        I ran a test a while
> back on a 4way Xeon box with multithreading and then took 3 processors out
> and ran the same test. The 4 processors managed to complete the tasks 2.4
> times quicker than the single processor so you lose approx. 40% of clock
> cycles due to memory conflicts?

To be expected, if you weren't carefully programming the software to do
it right.  If you can get down to 20% overhead that would be very good
already, I think.

>                                 When (evil me) rewrote the test software to
> be cruel and have every processor hitting the same memory block
> alternatively reading and writing to it the 4 processors managed 1.8 times
> quicker than the single processor, loss of 55% clock cycles there.

I'm surprised that it isn't worse.  Of course there are more efficient
ways to de-optimize your programs.

To gain something at all from multiprocessor machines, careful
consideration has to be put into dividing the problem into parts so that
most of it can run from cache (optimize for memory size, of CPU power
there is plenty) and memory to process overlaps as least as possible (to
avoid cache line transfers between CPUs).

>                                                                    The 4 CPU
> box with the extra processors was over 10 times the price of a single CPU
> box, great value for money there! lol

Lousy programmers give lousy programs.  And if you don't have a problem
fit for SMP it's a waste of money.  No news here.

> Old Way:
> 
> 	CMP AX,$2000
> 	JC  #4000
> 
> Two instructions to do that? Why use two clock cycles for that? Can't we
> just have a CMPJxx instruction? Takes two values/register combinations and
> based upon the results of the comparisons jumps to the destination or not.

There are two things to be done.  A subtraction and a conditional jump.
Just because you put them in the same instruction doesn't make them in
one cycle.  If they do execute in one cycle, I guess the cycles on that
machine are generally longer.

> This might also eliminate quite a few status bits which may make the ILP
> much easier?

So you'd always need an extra compare instead of just using the status
bits set by previous arithmetic.  Overall a slower architecture, I
assume.

>              The first CPU core I'll design will use only use intelligent
> 64bit instructions rather than two/three 32bit instructions. Memory is cheap
> so whilst my programs could be 2 or even 4 times bigger than other peoples
> they will execute faster which is the goal after all.

Memory may be cheap in the money sense but it is very expensive in a
performance sense.  Compare the instruction cycle length to the time
required for a cache line fill from SDRAM to get at the instructions in
the first place.  Something that needs twice the cycles will probably be
faster than something that needs twice the memory.  In the latter case
the CPU will execute less instruction cycles and will instead perform
many more memory access cycles.

-- 
Andreas Bombe <bombe@informatik.tu-muenchen.de>    DSA key 0x04880A44
--
To unsubscribe from cores mailing list please visit http://www.opencores.org/mailinglists.shtml

Follow-Ups:
- [oc] Processor Instruction reply for Andreas
  - From: "Paul McFeeters" <paul.mcfeeters@ntlworld.com>

References:
- Re: [oc] Re: Merlin Hybrid System
  - From: "Jim Dempsey" <dempsey@northnet.net>
- [oc] Intel Hyper-threading and CPU design, threads etc...
  - From: "Paul McFeeters" <paul.mcfeeters@ntlworld.com>

Prev by Date: [oc] Re: Merlin Hybrid System
Next by Date: [oc] Re: Merlin Hybrid System
Prev by thread: Re: [oc] Intel Hyper-threading and CPU design, threads etc...
Next by thread: [oc] Processor Instruction reply for Andreas
Index(es):
- Date
- Thread