[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [oc] Re: Merlin Hybrid System




----- Original Message -----
From: "Andreas Bombe" <bombe@informatik.tu-muenchen.de>
To: <cores@opencores.org>
Sent: Saturday, December 08, 2001 9:32 PM
Subject: [oc] Re: Merlin Hybrid System
> And you can automate finding the places where to insert spinlocks so
> that all non-atomic operations are safe but still not too much is
> locked?
>
On my WinME system as I write this there are running about 150
threads. You can make external determinations that for many of
these threads that spinlocks are not required. For example
portions of each of these threads (strands derived from the threads)
do not share the same address space (code and/or data). Therefore
there is no requirement for the insertion of a Spin Lock. For
portions of code and or data that overlap there are ways of using
the internal operations of features entended into the memory
management system. Examples of this are described on
Transmeta's website. Although they use this for invalidating
morphed code when self modifying code runs or when, as an example
I/O loads in new code. My intentions are different than this but
of no more difficulty to impliment.

> How small are those fragments, anyway?  You do want to separate at least
> at the level large loops, don't you?
>

They can be very small since context switch overhead might be as
low as 1 memory cycle (less if the context is internal to the chip).
They don't have to be loops. straight line code can be fragmented too.

The optimizations will be most benificial for a single Intel-like
processor running many independant threads. e.g. virtualy every
Windows based computer. The optimizations are least benificial
for a single thread (single code stream) dominating the processor
time. Most processor designers view optimizations from the single
code stream point of view and not from the overal system point
of view. My process approaches the problem from the overall
system point of view.

Traditional - how to speed up

    for i = 0 to m
        for j = 0 n
            func(i,j)
        next j
    next i

On a Windows based system you can have 100's of these running.
My method addresses how you get all of these code streams to
finish sooner. There is more bang for the buck if you can more of
these running at the same time instead of getting each to run slightly
faster.


> > A multithreaded Win32 application on a single processor system
> > can have portions of multiple threads running concurrently while
> > the O/S (Windows) is under the assumption that only one thread
> > is active. Let me re-word that. The replacement chip built, would
> > look to the motherboard and to the O/S (and applications) as
> > if it were a single processor.
>
> You're claiming magic SMP without code support.  That I've understood.
>

>From the outside viewpoint (e.g. that detected from the operating system)
there is a single processor capable of running but only one thread.
Internaly
there are multiple processors that are capable of fooling the observation of
what the processor has done.

While not disclosing what I am doing to some extent you can see this
effect on an Intel processor. The processor sucks in some number of
instructions for decoding. From the outside viewpoint is the instruction
pointer just after the last byte read? It is not there yet. Now assume
an interrupt occures while this wad of instructions are inside the pipeline.
You could flush the pipeline and un-roll the instruction pointer backwards.
You could defer honoring the interrupt until you got room in the pipeline.
You could start sucking in interrupt code while you complete execution
of code in the pipeline. All of this could occure without regard to what
teh operating system may think has happened.

The operating system doesn't care. It is only an abstraction for the
underlaying process. In this abstraction you can disjoin what it
thinks has happened from what has actualy happened. An this is
what I can exploit.

> I still won't believe until I can see it.

I can only show you under circumstances that won't inhibit patent
requirements. I am just a "little guy" with an out of the box idea.
If you are in a position to protect my interests and to advance
the idea from concept to reality then bring over your chalk board.

This concept is relatively easy to understand for a technicaly competent
person. It is one of those epiphany moments. A flash bulb will go off
when you see what it is. Implimenting the concept is a different story.
That will take money, which I don't have.


> >                                Inside this processor the strandification
> > process runs and distributes the processing to multiple processing
> > elements. This occurs now to a much lesser extent with processor
> > pipelines wherein multiple paths of a branch can begin to execute
> > as well as where FPU oprations are concurrent with integer
> > operations (and multi-media instructions, ...).
>
> Or where integer operations can operate concurrently with other integer
> operations.  There is more than one integer unit in every Pentium or
> Athlon, after all (and more than one FPU unit, except for Pentium 4).
>

The strands expand to encompass more code than just a few instructions.


> > > How are you going to automate that (finding the line/frame loop
> > > and taking it apart)?  Your converter would have to _understand_ the
> > > program, i.e. it would have to be an AI.
> > >
> >
> > This is a well defined process already. There are compilers that can
> > take say a FORTRAN program and parallelize it. So the techniques
> > have already been proven.
>
> A FORTRAN source program also gives a lot of information.  It also is
> somewhat stricter, so that it's easier to find points to parallelize.  I
> can not speculate about the quality of the output (or the input
> requirements).  For machine code that has no high level harness it's
> going to be a lot harder.
>

As indicated earlier, a system (e.g. Windows) has many programs running.
You only breakup a single thread at determinable breaking points. You
gain the remainder of the effectivenes by running independant strands from
different threads.

> > > I doubt the transparency part.  As for the effectiveness, have you
> > > benchmarked it?
> > >
> >
> > Transparency doesn't necessarily mean undetectable. You can write
> > code that can detect which version of the Pentium or Athlon your
> > code is running on. But for the most part applications run
> > transparently.
> >
> > As for benchmarking. How can I benchmark without building it?
>
> How can you tell it's effective without benchmarking it?
>

For a particular circumstance you could, using hypethectiicals, compute
the effectiveness. In practice, what you experience is much different.

How would you suggest I benchmark a mathmatical model of my WinME
system running AutoCad, WinAmp, DVD player, email, SETI and ten other
apps at the same time. This is way beyond the scope of a small fish like me.
Intel, AMD, Transmeta et al have the where with all to do this. I would
like to do this without getting picked to the bone.

Jim Dempsey
President
TAPEDISK Corporation


--
To unsubscribe from cores mailing list please visit http://www.opencores.org/mailinglists.shtml