This page contains the student (small and master thesis) tasks proposals
which are currently available within the MOVE project. Detailed information
is available from dr. Henk
Corporaal.
If you have any interest in one or more of these research areas
and its application within the MOVE project you are invited to perform
your thesis research within our group.
Below a list of possible thesis assignments follows. Note that this
list is not exhaustive and may change over time; however, it gives you
an indication of the type of research and development we are interested
in. Usually a final assignment description in one of the above areas is
made after investigating your interest and possibilities.
Exploiting complex operations
It is possible to implement patterns of RISC-like operations in a single
new, more complex operation. An simple example of this is the multiply-add:
a multiplication followed by an addition is implemented as a single operation.
The advantage of such fusions is faster execution and a reduction of traffic
across data buses. It is interesting to determine the potential speedup
and transport reduction of a particular application, gained from using
complex function units. This means: 1) finding out which combination of
operations may be usefull, 2) use these complex operations. The latter
can be done either using trace analysis (actually covering the trace of
executed operations with these complex ones), or during scheduling.
Hardware modelling: power, area, cycle time
During design space exploration it is necessary to quickly determine
how a certain processor would perform in terms of speed (cycle time), area
and power consumption. The task is to model processors, in particular those
generated with our MOVE framework, accurately enough to make proper design
decisions. Separate models for silicon area, cycle time, and power consumption
have to be developed.
Design space exploration
Due to the flexibility and scalability of the Move architecture, the
design space for processors is very large. If we want to determine the
optimal processor configuration for a given application, it is useful to
find out which resources are crucial to performance, and which ones can
be safely eliminated. This is done by scheduling the application on each
configuration that is chosen by an explorer program, after which processor
resource utilization data is available. This task concerns the following:
1) Making exploration data, together with the scheduled code, graphically
visible would enable us to: - analyze the quality of the schedules at a
glance, - identify bottlenecks in both application code and processor configuration;
2) Devise new design space exploration techniques and evaluate them.
Other important issues are: - speeding up the exploiration by parallelizing
the code; - base the exploration not only on performance and cost (silicon
area needed), but also on power consumption estimates.
Source-to-Source transformations: a restructuring
tool for embedded programs
Parallelizing compilers take as input sequential code written in a
high-level language and automatically generate a parallel executable. The
code parallelization is obtained by applying a set of transformations to
the FOR loops of the program, and for example partitioning their iteration
space among the processors. The transformations enable the parallelisation
and/or increase the scope in which the code can be parallelized. At the
TUD a tool for interactive restructuring of embedded programs written in
ANSI C has been implemented. This tool has to be extended to incorporate
all kinds of transformations, not only for making loops parallel, but also
to optimize memory traffic and exploit data locality.
Furthermore, it has to be researched when transformations should be
applied, and in which order. Therefore, a rule based system has to be designed
which integrates and controls this restructuring process.
Interprocedural data dependency analysis package
for embedded programs
Parallelizing compilers take as input sequential code written in a
high-level language and automatically generate a parallel executable. The
code parallelization is obtained by applying a set of transformations to
the FOR loops of the program, and for example partitioning their iteration
space among the processors.
To verify if the transformations and the partitioning are legal, data
dependency analysis is applied. For typical numerical applications they
perform very well. Unfortunately for the ``embedded'' class of programs
this is often not the case. They are almost exclusively written in the
C language. Intensive use of pointers, distribution of calculations among
a deep call hierarchy, complex control flow and data structures are common.
A new data dependency analysis package, handling these problems, has to
be implemented. This package should be build on the top of the existing
"classic" package.
Efficient machine code generation using SUIF
When generating code for custom architectures like ASIPs (Application
Specific Instruction-set Processors), a number of optimizations are essential
to efficient code. Our new MOVE compiler, currently under development,
generates code for a wide range of ASIPs. In parallel to the compiler,
a code optimizer needs to be developed. A challenging aspect of the development
is that several optimizations, if not applied carefully, may worsen the
code for some ASIP configurations. The optimizer is to be built on the
top of two libraries, which provide a convenient set of primitives for
manipulating the control flow graph and performing data flow analyses.
The student can thus focus on high level aspects concerning the algorithms,
rather than having to manage the internal details of the code and the graph
representations.
Design of a Sound Synthesis System for Digital
Organs
Together with a Dutch company, specialized in producing digital organs,
we are working on a new way of implementing digital organs. This is one
of the case studies within the MOVE project. The implementation concerns
many aspects like: coding the application, design and implementation of
a specific processor for sound synthesis (this includes the design of several
special function units), research into proper reverb and 3D-audio, porting
and using a real-time operating system, user interface, board design, testing,
etc. Part of these activities have been completed. Interested students
may contribute to this research, development and implementation.
Design of a processor for image enhancement
applications as used in digital copiers
This concerns another case study within the MOVE project, which is
performed in cooperation with a large printer and copier firm. The application
concerns several pixel based algorithms, like color conversion, sharpening,
and dithering. The application has been coded and a proposal for a specific
MOVE processor has already been made, including the design of several special
function units. Still the final implementation and realization of the processor,
and also board design and testing have to be performed. Plans are to extend
this case study to the mapping of the complete software trajectory of a
digital copier.
Exploiting reconfigurable logic (gate arrays)
within TTAs and VLIWs
The MOVE Processor Generator synthesizes custom microprocessors from
an architecture description. These processors are based on the highly flexible
and scalable TTAs. At first sight, it may seem that an application-specific
processor design provides for better price/performance than a more generic
processor, and is therefore preferred when production is high enough to
allow for a custom device. However, there can be other reasons for a design
choice than device cost versus re-use. An important advantage of a less
specialized processor is that it allows for algorithm and system design
changes late in a system design process, when the processor cannot be changed
anymore. So, flexibility will still be desirable for many custom designs.
However, improving both price/performance and generality will usually present
conflicting design goals, and for each design a compromise must be found.
One interesting solution may be the inclusion of reconfigurable logic
within a TTA, such as Field Programmable Gate Arrays (FPGAs). Using this
a TTA may provide for the desired combination of application-specific performance
and generality in many cases. This research should answer questions like:
can we use FPGA logic, for which applications, how can we use it (i.e.,
what pieces of code can be mapped on it, and how), what should be the basic
building blocks (cells) of these gate arrays.
Automatic mapping of loop-nests into a special
function unit
It may be advantageous to map whole loop-nests in programs onto special
functional units. After mapping, its iterations will be executed much faster
on this dedicated hardware. If in addition we apply vectorization then
the speedup will be even higher. Unfortunately direct mapping of most loop-nests
in the program will not lead to efficient implementations. This due to
existing inter-iteration dependencies. A sequence of code transformations
is usually necessary.
The goal of this assignment is to develop a methodology for selecting good candidate loop-nests, and then transforming them to the form in which they can be directly mapped onto vector special functional units implementing the functionality of the loop-nest body. The usage of the existing code transformation tool is recommended. The structure of these special vector units has to be designed as well.
Genetic logic synthesis
Present day logic synthesis and technology mapping tools are based
on traditional logic optimization algorithms and heuristics. Such synthesis
algorithms are deterministic and they produce correct results in reasonable
time. However, tools based on these algorithms require a huge development
effort, and they have to be continuously adapted to changing process technology
and design methods. Even worse, there are problems that can bring traditional
algorithms to their knees. Examples of such problems are wide arithmetic
and parity (ex-or) logic, where the number of minterms that is generated
during analysis by traditional algorithms can become unmanageable. Another
such problem is mapping to cell libraries with special contraints, such
as domino-logic (dynamic) cells.
These problems do not have all that much impact on 'random logic' designs,
but their impact on the efficiency of more regular designs, such as synthesized
(MOVE) custom processors, is huge. However, it may be possible to overcome
the weak points of logic synthesis by using Genetic Algorithms (GAs). Logic
synthesis is probably a good target for GAs because there is a lot of structure
and regularity in logic design. A GA approach would not suffer from the
minterm explosion as is the problem with known deterministic algorithms,
and achieve high quality synthesis in problematic areas even for 'non-standard'
operations (e.g. wide 3-input adders).
This research concerns questions like: check out the current problems,
check out other GA approaches to logic synthesis, determine GA approach,
determine proper benchmarks (case studies), evaluate approach.
Research into and design of a small but flexible
communication processor within the MOVE framework
A recent extension to the MOVE framework concerns the automatic design
of single-chip multi-processor systems. The (TTA) processors of such a
system have to be tightly connected in order to achieve low interprocessor
latency and high throughput. A proper communication method has to be designed.
Based on this method the TTA processor generator has to be extended with
an extra function unit: the communication processor. In addition the communication
method has to be modelled properly and incorporated within the MOVE framework,
in order to make correct design space exploration decisions (i.e. using
the right communication latencies, etc.).
Multi-threaded architectures
This research can be compared with the multi-scalar and other approaches
to exploit coarse-grain thread level parallelism using a tightly coupled
multi-processor which allows inter-processor communication at the register
file level (instead of traditional multi-processors which usually communicate
at the memory or i/o-level).
Key elements of this research are: - architecture design of a multi-threaded
processor; this also involves the design of a parametric simulator; - partitioning
sequential programs into threads; - code generation; - performance analysis.
Cache modelling
During design space exploration thousands of different architectures
are examined automatically for a certain application. During this examination
the effect of instruction and data caches is not yet taken into account.
The reason being that simulating caches each time takes far too much time.
Therefore we propose to model these caches (analytically).
This research should lead to such a model, and prove its effectiveness
during design space exploration (taking into account effects on cycle count,
cycle time, area and power consumption).
Scheduling and code generation for instruction
level parallel processors
Scheduling code is one of the final, and probably the most important,
stages of the code generation traject. Scheduling for TTAs essentially
means exploiting as much instruction level parallelism as possible, while
taking the TTA constraints into account, and try to exploit its specific
features (like operand sharing and software bypassing). Although our current
compiler is already quite advanced, we observed many more opportunities
for improvement; to mention a few:
Exploiting guards (conditional operations) and if-conversion Exploiting
multicasts (i.e. one-to-many transports) Static stack allocation Hardware
controlled loops Support for more complex function units
and many more. Your research involves one or more of these issues,
design and implement proper scheduling algorithms, and measuring and evaluating
their effect on a large set of benchmarks. You have to cooperate with several
other researchers which also work on the compiler backend.
Exploiting local memories
Processors for embedded systems may contain many local memories. E.g.
a 3-processor (single-chip) TTA designed for MPEG2 decoding contains about
30 of these memories, used for instructions, data, lookup tables, etc.
Up till now the designer has to decide manually whether and how to use
local memories. Thereafter he has to rewrite the code to make use of these
memories. This is a time consuming task, and it is not possible to investigate
quantitatively many design alternatives.
This research therefore aims at automating these steps. Based on profile
information and proper heuristics it should be decided which data/instructions/constants
can be kept locally. Code should be automatically transformed. The cost
and benefits of these memories have to be quantified and used during the
design space exploration.
Exploiting Compiler feedback
Although many design steps can be automated, a designer may do a better
job if he knows exactly what is going on; i.e. when he knows why certain
design decisions are taken by the tools, and also why others were not
taken. This research concerns the topic of how to exploit feedback information
from the compiler, especially from the scheduling and code generation phase.
During this phase the compiler tries to exploit the available hardware
as much as possible. However often it is hindered by dependencies, unavailable
resources, code which may not be executed speculatively, and many other
things. This research should investigate these issues, which information
should be made available to the designer, and which design actions can
be taken based on this information. This is a new and open research topic.
Generating testable hardware using BIST
At our laboratory we have a lot of experience with testing of, and
generating test patterns for combinational and sequential circuits. Now
the time is there to apply this knowledge to TTA processors. This research
concentrates on: how to test processors in general and TTAs in specific.
Its result should be a proposal for a BIST (built-in self test) method
for TTAs. The method has to be implemented within our processor generator
(written in VHDL), and evaluated with respect to required silicon area,
effect on cycle time, test time, and its applicability to other processor
designs.
Last modified on April 21st, 1998 by Andrea Cilio, email
A.Cilio@its.tudelft.nl