Student Tasks Proposals


This page contains the student (small and master thesis) tasks proposals which are currently available within the MOVE project. Detailed information is available from dr. Henk Corporaal.
 

Research activities

Within the MOVE project research is performed in the following areas:
  1. Computer architecture and engineering
  2. Application specific processing
  3. Parallel processing (ILP and coarse-grain parallelism exploitation)
  4. Heterogeneous (multi-processor) embedded systems
  5. Multi-threaded architectures
  6. Multi-media processors
  7. Low power processor design
  8. Hardware/software co-design
  9. High performance compilers: scheduling and code generation for ILP processors


If you have any interest in one or more of these research areas and its application within the MOVE project you are invited to perform your thesis research within our group.

Below a list of possible thesis assignments follows. Note that this list is not exhaustive and may change over time; however, it gives you an indication of the type of research and development we are interested in. Usually a final assignment description in one of the above areas is made after investigating your interest and possibilities.
 

Exploiting complex operations
It is possible to implement patterns of RISC-like operations in a single new, more complex operation. An simple example of this is the multiply-add: a multiplication followed by an addition is implemented as a single operation. The advantage of such fusions is faster execution and a reduction of traffic across data buses. It is interesting to determine the potential speedup and transport reduction of a particular application, gained from using complex function units. This means: 1) finding out which combination of operations may be usefull, 2) use these complex operations. The latter can be done either using trace analysis (actually covering the trace of executed operations with these complex ones), or during scheduling.

Hardware modelling: power, area, cycle time
During design space exploration it is necessary to quickly determine how a certain processor would perform in terms of speed (cycle time), area and power consumption. The task is to model processors, in particular those generated with our MOVE framework, accurately enough to make proper design decisions. Separate models for silicon area, cycle time, and power consumption have to be developed.

Design space exploration
Due to the flexibility and scalability of the Move architecture, the design space for processors is very large. If we want to determine the optimal processor configuration for a given application, it is useful to find out which resources are crucial to performance, and which ones can be safely eliminated. This is done by scheduling the application on each configuration that is chosen by an explorer program, after which processor resource utilization data is available. This task concerns the following: 1) Making exploration data, together with the scheduled code, graphically visible would enable us to: - analyze the quality of the schedules at a glance, - identify bottlenecks in both application code and processor configuration; 2) Devise new design space exploration techniques and evaluate them.
Other important issues are: - speeding up the exploiration by parallelizing the code; - base the exploration not only on performance and cost (silicon area needed), but also on power consumption estimates.

Source-to-Source transformations: a restructuring tool for embedded programs
Parallelizing compilers take as input sequential code written in a high-level language and automatically generate a parallel executable. The code parallelization is obtained by applying a set of transformations to the FOR loops of the program, and for example partitioning their iteration space among the processors. The transformations enable the parallelisation and/or increase the scope in which the code can be parallelized. At the TUD a tool for interactive restructuring of embedded programs written in ANSI C has been implemented. This tool has to be extended to incorporate all kinds of transformations, not only for making loops parallel, but also to optimize memory traffic and exploit data locality.
Furthermore, it has to be researched when transformations should be applied, and in which order. Therefore, a rule based system has to be designed which integrates and controls this restructuring process.

Interprocedural data dependency analysis package for embedded programs
Parallelizing compilers take as input sequential code written in a high-level language and automatically generate a parallel executable. The code parallelization is obtained by applying a set of transformations to the FOR loops of the program, and for example partitioning their iteration space among the processors.
To verify if the transformations and the partitioning are legal, data dependency analysis is applied. For typical numerical applications they perform very well. Unfortunately for the ``embedded'' class of programs this is often not the case. They are almost exclusively written in the C language. Intensive use of pointers, distribution of calculations among a deep call hierarchy, complex control flow and data structures are common. A new data dependency analysis package, handling these problems, has to be implemented. This package should be build on the top of the existing "classic" package.

Efficient machine code generation using SUIF
When generating code for custom architectures like ASIPs (Application Specific Instruction-set Processors), a number of optimizations are essential to efficient code. Our new MOVE compiler, currently under development, generates code for a wide range of ASIPs. In parallel to the compiler, a code optimizer needs to be developed. A challenging aspect of the development is that several optimizations, if not applied carefully, may worsen the code for some ASIP configurations. The optimizer is to be built on the top of two libraries, which provide a convenient set of primitives for manipulating the control flow graph and performing data flow analyses. The student can thus focus on high level aspects concerning the algorithms, rather than having to manage the internal details of the code and the graph representations.

Design of a Sound Synthesis System for Digital Organs
Together with a Dutch company, specialized in producing digital organs, we are working on a new way of implementing digital organs. This is one of the case studies within the MOVE project. The implementation concerns many aspects like: coding the application, design and implementation of a specific processor for sound synthesis (this includes the design of several special function units), research into proper reverb and 3D-audio, porting and using a real-time operating system, user interface, board design, testing, etc. Part of these activities have been completed. Interested students may contribute to this research, development and implementation.

Design of a processor for image enhancement applications as used in digital copiers
This concerns another case study within the MOVE project, which is performed in cooperation with a large printer and copier firm. The application concerns several pixel based algorithms, like color conversion, sharpening, and dithering. The application has been coded and a proposal for a specific MOVE processor has already been made, including the design of several special function units. Still the final implementation and realization of the processor, and also board design and testing have to be performed. Plans are to extend this case study to the mapping of the complete software trajectory of a digital copier.

Exploiting reconfigurable logic (gate arrays) within TTAs and VLIWs
The MOVE Processor Generator synthesizes custom microprocessors from an architecture description. These processors are based on the highly flexible and scalable TTAs. At first sight, it may seem that an application-specific processor design provides for better price/performance than a more generic processor, and is therefore preferred when production is high enough to allow for a custom device. However, there can be other reasons for a design choice than device cost versus re-use. An important advantage of a less specialized processor is that it allows for algorithm and system design changes late in a system design process, when the processor cannot be changed anymore. So, flexibility will still be desirable for many custom designs. However, improving both price/performance and generality will usually present conflicting design goals, and for each design a compromise must be found.
One interesting solution may be the inclusion of reconfigurable logic within a TTA, such as Field Programmable Gate Arrays (FPGAs). Using this a TTA may provide for the desired combination of application-specific performance and generality in many cases. This research should answer questions like: can we use FPGA logic, for which applications, how can we use it (i.e., what pieces of code can be mapped on it, and how), what should be the basic building blocks (cells) of these gate arrays.

Automatic mapping of loop-nests into a special function unit
It may be advantageous to map whole loop-nests in programs onto special functional units. After mapping, its iterations will be executed much faster on this dedicated hardware. If in addition we apply vectorization then the speedup will be even higher. Unfortunately direct mapping of most loop-nests in the program will not lead to efficient implementations. This due to existing inter-iteration dependencies. A sequence of code transformations is usually necessary.

The goal of this assignment is to develop a methodology for selecting good candidate loop-nests, and then transforming them to the form in which they can be directly mapped onto vector special functional units implementing the functionality of the loop-nest body. The usage of the existing code transformation tool is recommended. The structure of these special vector units has to be designed as well.

Genetic logic synthesis
Present day logic synthesis and technology mapping tools are based on traditional logic optimization algorithms and heuristics. Such synthesis algorithms are deterministic and they produce correct results in reasonable time. However, tools based on these algorithms require a huge development effort, and they have to be continuously adapted to changing process technology and design methods. Even worse, there are problems that can bring traditional algorithms to their knees. Examples of such problems are wide arithmetic and parity (ex-or) logic, where the number of minterms that is generated during analysis by traditional algorithms can become unmanageable. Another such problem is mapping to cell libraries with special contraints, such as domino-logic (dynamic) cells.
These problems do not have all that much impact on 'random logic' designs, but their impact on the efficiency of more regular designs, such as synthesized (MOVE) custom processors, is huge. However, it may be possible to overcome the weak points of logic synthesis by using Genetic Algorithms (GAs). Logic synthesis is probably a good target for GAs because there is a lot of structure and regularity in logic design. A GA approach would not suffer from the minterm explosion as is the problem with known deterministic algorithms, and achieve high quality synthesis in problematic areas even for 'non-standard' operations (e.g. wide 3-input adders).
This research concerns questions like: check out the current problems, check out other GA approaches to logic synthesis, determine GA approach, determine proper benchmarks (case studies), evaluate approach.

Research into and design of a small but flexible communication processor within the MOVE framework
A recent extension to the MOVE framework concerns the automatic design of single-chip multi-processor systems. The (TTA) processors of such a system have to be tightly connected in order to achieve low interprocessor latency and high throughput. A proper communication method has to be designed. Based on this method the TTA processor generator has to be extended with an extra function unit: the communication processor. In addition the communication method has to be modelled properly and incorporated within the MOVE framework, in order to make correct design space exploration decisions (i.e. using the right communication latencies, etc.).

Multi-threaded architectures
This research can be compared with the multi-scalar and other approaches to exploit coarse-grain thread level parallelism using a tightly coupled multi-processor which allows inter-processor communication at the register file level (instead of traditional multi-processors which usually communicate at the memory or i/o-level).
Key elements of this research are: - architecture design of a multi-threaded processor; this also involves the design of a parametric simulator; - partitioning sequential programs into threads; - code generation; - performance analysis.

Cache modelling
During design space exploration thousands of different architectures are examined automatically for a certain application. During this examination the effect of instruction and data caches is not yet taken into account. The reason being that simulating caches each time takes far too much time. Therefore we propose to model these caches (analytically).
This research should lead to such a model, and prove its effectiveness during design space exploration (taking into account effects on cycle count, cycle time, area and power consumption).

Scheduling and code generation for instruction level parallel processors
Scheduling code is one of the final, and probably the most important, stages of the code generation traject. Scheduling for TTAs essentially means exploiting as much instruction level parallelism as possible, while taking the TTA constraints into account, and try to exploit its specific features (like operand sharing and software bypassing). Although our current compiler is already quite advanced, we observed many more opportunities for improvement; to mention a few:
Exploiting guards (conditional operations) and if-conversion Exploiting multicasts (i.e. one-to-many transports) Static stack allocation Hardware controlled loops Support for more complex function units
and many more. Your research involves one or more of these issues, design and implement proper scheduling algorithms, and measuring and evaluating their effect on a large set of benchmarks. You have to cooperate with several other researchers which also work on the compiler backend.

Exploiting local memories
Processors for embedded systems may contain many local memories. E.g. a 3-processor (single-chip) TTA designed for MPEG2 decoding contains about 30 of these memories, used for instructions, data, lookup tables, etc. Up till now the designer has to decide manually whether and how to use local memories. Thereafter he has to rewrite the code to make use of these memories. This is a time consuming task, and it is not possible to investigate quantitatively many design alternatives.
This research therefore aims at automating these steps. Based on profile information and proper heuristics it should be decided which data/instructions/constants can be kept locally. Code should be automatically transformed. The cost and benefits of these memories have to be quantified and used during the design space exploration.

Exploiting Compiler feedback
Although many design steps can be automated, a designer may do a better job if he knows exactly what is going on; i.e. when he knows why certain design decisions are taken by the tools, and also why others were not taken. This research concerns the topic of how to exploit feedback information from the compiler, especially from the scheduling and code generation phase. During this phase the compiler tries to exploit the available hardware as much as possible. However often it is hindered by dependencies, unavailable resources, code which may not be executed speculatively, and many other things. This research should investigate these issues, which information should be made available to the designer, and which design actions can be taken based on this information. This is a new and open research topic.

Generating testable hardware using BIST
At our laboratory we have a lot of experience with testing of, and generating test patterns for combinational and sequential circuits. Now the time is there to apply this knowledge to TTA processors. This research concentrates on: how to test processors in general and TTAs in specific. Its result should be a proposal for a BIST (built-in self test) method for TTAs. The method has to be implemented within our processor generator (written in VHDL), and evaluated with respect to required silicon area, effect on cycle time, test time, and its applicability to other processor designs.



Last modified on April 21st, 1998 by Andrea Cilio, email A.Cilio@its.tudelft.nl