Transputers - a look back at a great microprocessor

by John Catsoulis

Introduction



INMOS introduced the first transputer in 1985. The transputer
was an innovative device. For the first time, a processor had been
combined with a communications subsystem for the specific purpose
of constructing scalable, parallel machines. Transputers
communicated with each other over four, high-speed serial links.
The transputer also implemented a hardware process-scheduler,
allowing easy implementation of multi-processing. Though
origninally designed for building supercomputers, the most common
application of transputers was in embedded systems. Some of the
most impressive arcade games (featuring high-resolution 3D
graphics and stereo sound) had transputer arrays at their heart.
INMOS produced a range of transputers. A summary of the processor
family is given in the table below.











































































































Property







T222






T225






M212*






T414






T425






T800






T9000




Architecture (bits)





16


16


16


32


32


32


64


Internal Cycle Time (ns)



50



50


50


50


33


33


20


Performance (MIPS)



20


20


20


20


30


30


100 - 200


Floating Point Unit



No


No


No


No


No


Yes


Yes


On-chip RAM (bytes)



2 K


2 K


2 K


2 K


4 K


4 K


16 K


Internal Bandwidth (M/sec)



80


80


80


80


120


120


200


Address Space (bytes)



64 K


64 K


64 K


4 G


4 G


4 G


4 G


Interrupt Response (ns)



950


950


950


950


630


630


unstated


Link Speed (Mbits/sec)



20


20


20


20


20


20


100




* The M212 was a special 16-bit transputer designed to
provide an interface to disk subsystems.



 



Transputer Architecture



The basic transputer (the original T414, for example) consisted
of a conventional, sequential, RISC processor, a communication
subsystem (implemented as four high-speed, inter-processor links),
4 k of on-chip RAM and an on-chip memory interface.





Functional Diagram



The processor had only three general-purpose registers, A, B
and C. These registers are treated as a stack by the transputer
instruction set. Loading a value into A pushes the previous
contents of A into B, and the previous contents of B into C.
Similarly, storing a value from A pops B into A and C into B. The
instruction set works with this register stack implicitly.
Performing an add instruction adds the top two operands on the
stack. In other words, an add instruction always adds B to A,
leaving the result in A. There is no mechanism provided to ensure
that a stack overflow does not occur. This is left to the compiler
or assembly language programmer.



A similar programming model exists for the floating point unit
found in the T8-series of transputers. In addition to the three
general-purpose registers, the transputer also has a workspace
pointer, an instruction pointer and an operand register. The
workspace pointer points to a region of memory where the
parameters of the currently executing task are stored. The
instruction pointer references the next instruction to be fetched
and executed by the processor. The operand register is used in the
formation of instruction operands (the transputer's instruction
set is somewhat unusual in this respect). The programmer's model
is shown below.





Transputer Programmer's Model



It can be seen from the above that the transputer programming
model is a simple one. The transputer exploited the high-speed,
internal RAM to overcome the limitations of a small register set.
The small register set and RISC instructions meant that the
transputer had simple and fast data-paths and control logic.



The instruction set of the transputer is an unusual one. A
design decision was taken early in the transputer's development
that the transputer should be programmed in a high-level language.
One look at the instruction set for these machines is sufficient
to convince even the most die-hard of assembly-language
programmers to follow the INMOS suggestion. The instruction set
contains a relatively small number of instructions. Each
instruction is one byte long. The upper four bits of the opcode
specify the function to be performed, the lower four bits contain
the data. Two of the function codes allow the operand of an
instruction to be extended to any length up to the size of the
operand register.



The transputer supported two priority levels for executing
tasks. High-priority tasks took precendence over low-priority
ones. A low-priority task will only be executed if there is no
high-priority task in the schedule. At any one time, a process may
be either active (executing or awaiting execution) or inactive
(waiting for I/O, waiting until a specified time or waiting for a
semaphore). The architecture of the transputer is such that
inactive processes do not use any processor time. The transputer
had two registers which point to a linked list of workspaces. This
list of workspaces constitute the process table. The task
switching time is very small and hence the transputer
implementation of multi-tasking is a very efficient one. The
transputer instruction set directly supports process creation and
termination.





Programmer's Model of the Transputer Process
Table



 



Transputer Clocks and External Buses



Most fast processors require an external clock at their
operational frequency. This can cause the system designer some
problems as high-frequency clocks can be affected by noise, skew
and termination problems. They can also be a source of noise for
the rest of the computer system. The engineers at INMOS took a
slightly different approach in addressing this problem. Rather
than having an external, high-speed oscillator, all transputers
ran off an external 5 MHz clock. Each processor had an internal
clock source which is phase-locked to the external clock. In this
way, the system designer did not have to worry about the problems
normally associated with high-frequency clocks.



A T255 transputer memory cycle is divided into four clock
states. Address and control setup occurs in state T1. Data is
setup in state T2. State T3 is when data is read or written. State
T4 is data and address hold time after an access. The signal,
ProcClockOut is an output clock from the processor. An
access begins in clock state T1 with an address becoming valid on
MemA0-15. The address strobe, notMemCE goes active
indicating that a valid address is on the address bus.





T225 Read Cycle



For a write cycle, notMemWrB0-1 will go active low (both
for 16 bits, one only for 8 bits). For a read cycle,
nonMemWrB0-1 will stay high. During a read cycle, data is
latched by the processor in state T4. The processor outputs data
during a write cycle in state T2 and holds this until state
T4.





T225 Write Cycle



Wait states may be generated by driving MemWait high
within 25 ns of ProcClockOut high in state T1. The
processor inserts wait states between clock states T2 and T3. Wait
states may be easily generated using flip flops. A circuit to
generate two wait states for a transputer is shown below.
notMemCE enables the flip flops at the beginning of a cycle
and (re-)SETs them at the end of the cycle. Initially
MemWait is high. After two clocks, MemWait goes low.
Adding additional wait states is just a matter of adding extra
flip flops.





Wait State Generator for a T225



Interfacing a T225 is straight forward. The example below shows
a T225 interfaced to two, 32k x 8 static RAMs. Note that since
transputers are able to work as processing elements in
multi-processor systems, it is common to have transputers
interfaced to memory alone. In such cases, the transputer is
booted over one of its communication links and uses I/O devices
interfaced to other transputers. (Note that not all pins are shown
in the example below.)





Interfacing a T225 Transputer to Static RAM with No Wait
States



BootFromROM is tied low. This causes the transputer to
wait to be booted over one of its communication links after reset.
If this processor was connected to a ROM, BootFromROM would
be tied high. This causes the processor to automatically begin
executing code after reset.



The communication links in the above example are used to
connect this transputer to other transputers in the system. The
Link0-3In lines should have 100k pull-down resistors to
ground. The Link0-3Out lines should have 56 resistors in
series with the link.



 



Transputer Communication



Communication between processes is achieved through channels.
Channels are uni-directional, synchronised and unbuffered. A
channel may be between two processes running on a single
transputer, in which instance the channel is implemented using a
word in memory, or a channel may be between processes executing on
separate transputers, where the channel is implemented using the
high-speed, external links.



The external links between transputers are implemented using
two uni-directional lines connecting each transputer pair. The
data is transferred serially at high speed.



A major problem with the conventional architecture of the
transputer is the distinction between inter-process communication
via an on-chip channel or via an external link. Software has to be
written and compiled specifically for the transputer system it is
to run on and is not easily portable to other transputer networks.
In addition, algorithms have to be specifically matched to the
architecture of the transputer system to ensure efficient use of
the machine. This had to be specified at or before compilation
time. The T9000 processor was an attempt to address these
problems. It was intended as a standard architecture machine to be
used in message-passing MIMD computers. Inter-process
communication is achieved through virtual channels rather than
distinct hard or soft channels. This greatly simplifies the
programming of the transputer system and leads to greater software
portability.



 



T9000



The T9000 was the last processor in the transputer family. It
comprised a superscalar 32-bit processor, 64-bit floating point
unit, a dedicated communications processor with four links, an
inbuilt external memory interface and 16 K of onchip memory, which
may be used as cache, normal memory or a combination of the
two.



The IMS T9000 processor had a peak performance of 200 MIPS and
25 MFLOPS whilst operating as a single processor running at 50
MHz, which given the time of its design and that it was running
off a 5 MHz crystal, was quite amazing.





T9000 Functional Diagram



The T9000 executes processes sequentially, but implements a
process scheduler in hardware. In transputers, multi-tasking is
moved from the software kernel to the hardware of the processor.
The T9000 has three registers to implement the process table. The
Front register points to the first process in the table and the
Back register points to the last process in the table. Each entry
in the process table contains a pointer to the next process. The
Workspace Pointer points to the process currently being executed.
A process may either be active or inactive. Active processes are
either being executed or are awaiting execution. Inactive
processes are descheduled and may be waiting for I/O, a semaphore
or until a specified time. The scheduler of the T9000 works in
such a way that inactive processes do not consume any processor
time. This is a very efficient implementation of multi-tasking and
is quite different to that used on machine with multi-tasking
implemented through a kernel.



The instruction set of the T9000 supports multi-tasking
directly. Two instructions, startprocess and endprocess directly
affect the process table. They allow, as their names suggest, the
creation and destruction of processes in the process table.



Process scheduler has two process queues, one for high-priority
processes and one for low-priority processes. A low-priority
process will only be executed by the processor if no high-priority
process is waiting. In addition, a high-priority process becoming
active causes the scheduler to suspend the current low-priority
process and begin execution of the high-priority process.



 



T9000 Communication



The message-passing system implemented in the T9000 is
point-to-point and synchronised. By employing this form of
communication, the T9000 does not require message queues or
buffers to be implemented directly by the processor. This is the
same communication mechanism employed in the first-generation
transputers. However, the T9000 extends the communication system
implemented in earlier transputers through the use of virtual
links rather than specific hard links. A virtual link may
represent a communication channel between a local process and a
process located elsewhere in the parallel machine. Several virtual
links may use a single hardware link for message routing. This
allows for an arbitrary number of virtual links to exist over a
limited number of physical links. Thus, the communication
limitation of the earlier transputers is overcome.



The virtual links are controlled by a dedicated, communication
processor located within the T9000. This processor is known as the
virtual channel processor (VCP). The VCP accepts packets
for communication along the virtual links and routes these either
directly to another T9000 or through a network of routers using
the hardware links of the T9000.



 



T9000 Memory Interface



The clocks for the T9000 are generated internally. These clocks
are phased-locked to an external 5 MHz clock. This avoids problems
associated with routing high frequency clocks throughout the
computer system. The processor has three input pins which may be
used to specify operating speed. Maximum speed for the T9000 is
expected to be 50 MHz for the first version of the processor.



The T9000's bus interface provides direct support for dynamic
RAMs. The address lines of the T9000 behave as multiplexed address
lines when accessing a region of memory defined as dynamic RAM.
The timing of ~RAS and ~CAS may be configured under
software control to suit the memory devices used.





Interfacing a T9000 to 8 Mbytes of DRAM



The T9000 provides direct support for booting from an EPROM. It
has a dedicated chip enable (notMemBootCE) for an EPROM and
dynamically sizes its bus down to eight bits when accessing the
ROM. The T9000 also generates appropriate addresses for the ROM on
address lines MemAdd2 through MemAdd15. These
address lines correspond to A2 through A15 on the
ROM. Two programmable strobe lines (notMemWrB2 and
notMemWrB3) become address bits A0 and A1
when the processor is accessing the ROM address space.



An input to the T9000 (StartFromROM) determines whether
the transputer will boot off a local ROM, or will remain idle,
waiting to be booted by a root transputer over one of its
high-speed links. Normally only one transputer would have a boot
ROM and this transputer would then configure the other transputers
in the network through the communication links. However, it is
also possible for individual transputers to have their own ROM.
This ROM may contain local boot information or software to
configure and control I/O devices local to that transputer.



The basic configuration for interfacing a T9000 transputer to a
boot EPROM is shown below.





Interfacing a T9000 to a Boot ROM



C104 Crossbar



An essential element of the T9000 system architecture is the
ability to pass messages quickly between processing nodes of the
parallel machine. Interconnection is provided by dedicated VLSI
routers implementing a 32-way crossbar. The INMOS C104 is the
companion router to the T9000. Separating the processing unit
(T9000) from the router (C104) has a number of advantages
[May, et al. 1993] :




  • Systems not requiring message routing may interconnect
    T9000s directly.


  • Routers may have a relatively large number of links
    allowing large systems to be constructed from a small number of
    routers.


  • Transputers are not required to implement message passing
    between neighbours.


  • The architecture of the parallel machine is scalable.



Each C104 has 32 communication links capable of operating at
100 Mbits/sec. The router provides wormhole routing of incoming
data packets. As the header of each packet arrives on a given
link, the C104 determines the destination link for the packet and
generates the appropriate internal route before the main body of
the packet is sent. The C104 is capable of routing packets through
all links concurrently. The C104 uses the header of each packet to
identify which output link to use.





Wormhole Routing through Multiple C104s



The C104 is also capable of arbitrating between multiple data
packets requiring the same output link and causes them to be
output sequentially along the required link. [May, et al.
1993]



The C104 provides for header deletion. This allows a packet to
contain multiple headers specifying a path through several
routers. Each router routes the packet and deletes the current
header, exposing the next header for routing by the next C104.
Thus, C104s may be combined together to form large, multi-layered
networks.



 



MIMD using T9000s



Barnaby, May and Nicole in Networks, Routers and
Transputers
[May, et al, 1993] propose a
fully-interconnected, folded-Clos architecture for a
multiple-T9000 parallel machine. Groups of four T9000s are
interconnected using C104 crossbars. These crossbars are in turn
connected to a second layer of crossbars, providing not only full
interconnectivity for the system, but multiple, concurrent paths
between processors.





Folded-Clos MIMD Machine using T9000s and
C104s



In this system, each T9000 has its own local memory for program
and data storage. Some T9000s may have I/O facilities and as act
I/O servers for the rest of the system. Only one T9000 requires
ROM and is capable of first booting the network, then booting the
other processors over the network.



A separate control network interconnects the other T9000s and
crossbars and allows them to be configured by the boot transputer.
Note that the control network is separate to the data network
which carries interprocess communication. Each T9000 and C104 have
dedicated control links (Clink0 and Clink1) in
addition to the four data links. Clink0 is the control
input link and Clink1 is the output control link. In small
systems using only T9000s and no routers, the control links are
daisy-chained together. The Clink1 of one transputer is fed
to the Clink0 input of the next transputer.





Configuring a Small, Routerless, MIMD
System



However, it is also possible to use a C104 to provide routing
of the control network. This is preferable in larger networks
where latency in the communication channel can cause unacceptable
delays if daisy-chaining is used. The C104 is configured through
its control link, however the data links of this C104 are
connected to the control links of the T9000s in the MIMD machine.
This is achieved by connecting the C104's output control link,
Clink1, to one of its data input links.



Thus control packets may be routed by the C104 to other devices
(T9000s or data-network C104s) in the system as though they were
data packets. The control links, although functionally different
to the data links, use the same protocol for communication. The
C104 used to implemement the control network is not used as part
of the data network.





Configuring an MIMD System using a C104 Crossbar
 



 



Booting



There are six levels of reset for a T9000-based, MIMD machine.
When the machine powers up and undergoes a hard reset, all T9000s
in the system are in their reset state. This is reset level 0.
Each transputer checks its StartFromROM input. All
transputers, save one, will have this tied low and will therefore
wait to be booted over their links. The transputer with
StartFromROM high will configure its bus interface
appropriately and begin executing code from its boot ROM. This
processor is the root processor for the network and the control
processor for the machine. The software executed by this
transputer identifies and 'labels' all devices connected to the
control network. This includes configuring the control-network
C104. The system is now at reset level 1.



The boot transputer, under software control, uses the control
network to configure all devices. The system is now at reset
level 2
.



The virtual links are configured, boot code is downloaded to
each transputer and executed. All transputers are now operational
and the system is at reset level 3.



The processes belonging to the application software are
downloaded and distributed throughout the system over the data
network. Each transputer in the system configures its virtual
links as required by the application. The system is now at
reset level 4.



The application is then executed by the machine. The system is
now at reset level 5 and is operational.



Normally, the system is only taken through the above sequence
by a hard reset. However, the control process (operating system)
may reboot the machine by sending a Reboot message to some or all
devices in the system.



 



Conclusion



The transputer was a great architecture, in many ways ahead of
its time. Its novel stack-based architecture, inbuilt
mutli-tasking support and inter-processor and inter-process
communication were revolutionaly. It was a beautiful machine.
Unfortunately, it was killed not long after its company was taken
over by another, and no further development was done on the
architecture.




3 Comments

jbgreer
2004-02-28 09:10:11
Indeed


I still have copies of both the Communicating Process Architecture and Transputer Instruction Set: A compiler writer's guide, both published by Prentice Hall. In 1988 I was an attendee and poster-presenter at the 3rd Conference on Hypercube Concurrent Computers and Applications. Several of the INMOS luminaries were there to discuss the Transputer, which was being used in a number of standalone parallel computers and parallel processing cards - unit which were plugged into an ordinary PC. They were very impressive; I recall standing slack-jawed as one such card-based PC unit drew a Mandelbrot set on screen in real-time; the 16 node Intel Scientific Hypercube back at university was glacial by comparison.



I am amazed that you managed to only briefly touch on 'the high level language' used to program Transputers. That language, of course, is Occam. It's involvement with the hardware is closer than you describe. The floating point unit for the T800 unit, for instance, was first written in Occam, verifed via formal analysis, and then translated into hardware. I recall discussing this with Welch, who was very passionate about one's ability to prove Occam code. Sadly, Occam seemed to be a bit of a stumbling block to Transputer adoption; I remember INMOS advertising a C compiler for the units as well. Occam, though, seemed much superior, since the language had parallel, sequential, and serial constructs by which to organize the flow of code. That, and its primitives that mapped to the communication links.



I look back on those times with great fondness - parallel computing was very exciting (and primitive, too). Inmos, at the time, seemed to have a brilliant solution, especially when compared to Intel-based solutions. The delays in getting the T9000 unit to market, though, did allow their competitors to catch up a bit. By that point, even an outsider such as I could sense that the company had faltered; I just had no idea how badly,

johncatsoulis
2004-02-29 16:04:44
Indeed
You are right about Occam. It was a beautiful language, although it required a different mindset to C. I think a lot of people just couldn't make the mental shift to it. Interestingly, the language was created first, and then the hardware (transputer) was developed to run it. Usually, it is the other way around.


There was a project a few years back to port Occam to other platforms, but I don't know what happened to it. It was called "Southampton Portable Occam Compiler" or "SPOC" for short. There are lots of links under Google, but most seem to be old.


I too have lots of old Transputer documentation. I've got original databooks for the pre-release T9000, as well as INMOS's Networks, Routers and Transputers book. It's a great reference for parallel computing, even after 10 years. I've even got a couple of T2xx transputer chips sitting in the cupboard, awaiting the spare time and motivation to turn them into a machine.


jtc

rmeenaks
2004-03-04 09:19:59
Indeed
Hi,


I maintain the transputer homepage at http://www.classiccmp.org/transputer to keep the memory alive. There is a much more recent Occam compiler called KrOC which generates machine code for the target platform. It is based on the original Occam-2 compiler from INMOS with a lot of enhancements for Occam-3 and other constructs. Definitely worth a look into. SPOC was an Occam-2 to ANSI-C compiler.


I am taking dontations for documentation, hardware, and software if anyone can contribute...



Cheers,


Ram