New chips bring new programming

National Instruments Aust Pty Ltd
By Jeff Meisel*
Tuesday, 14 April, 2009

Today, however, increasing clock speeds for performance gains is not viable because of power consumption and heat dissipation constraints.

Chip vendors have instead moved to entirely new chip architectures with multiple processor cores on a single chip. With multicore processors, programmers can complete more total work than with one core alone. However, to take advantage of multicore processors, programmers must reconsider how they develop applications.

In the words of Herb Sutter, Microsoft software architect, the “free lunch is over” for developers who expect to see immediate software application performance gains when end users simply upgrade their computers to ones with faster processors. In short, programmers now have to work for continued performance improvements.

Sequential programs saw performance improvements as a result of processor clock speed increases; upgrading to a computer with a faster CPU meant that each instruction in a series would run faster.

To continue seeing performance gains with multicore systems, developers need to design their applications to divide the work among cores — in essence, develop a parallel application instead of a sequential one.

Multitasking is the ability to do multiple things at the same time. This is of paramount importance today as computer users want to surf the web, check email, listen to iTunes and use the development and business tools at their jobs all at the same time.

To understand multitasking, consider the analogy of a mechanic who operates a small car repair shop.

How does the mechanic decide which car to service first if several customers are waiting?

It comes down to a scheduling issue. Operating systems face this same dilemma of deciding which task is most important to execute, how fast the task needs to be completed and, in general, how best to use the available system resources.

A basic scheduling concept such as ‘round robin’ is used in some operating systems, which defines time slices for each task and ensures every task is treated equally.

A more sophisticated scheduling concept found in commercial OSs is ‘preemptive multitasking’ — where tasks share a CPUS but one can obtain a higher priority and jump to the front of the execution line.

If an ambulance needs service, the mechanic most likely will force the other customers to wait longer and repair the ambulance so it can answer emergency calls.

While multitasking appears to be the answer to completing several actions at one time, really only the most important task is guaranteed to execute in a preemptive multitasking scheme.

Now consider the car lift, which is doing all the heavy lifting. It enables work but, at the same time, it can become a bottleneck. At the point when enough customers are waiting at the shop at the same time, the mechanic has to turn down work because there is no way to repair all the cars.

One way to repair more cars faster is to hire more employees. This increases repair completion speed because more help is available to work on a problem at the same time. Another answer would be to add more car lifts, so mechanics could repair more cars simultaneously.

In today’s computing landscape, adding more computing engines to do the ‘heavy lifting’ is the only way to increase overall work throughput. This was not so much a choice as it was a necessity for silicon manufacturers, who could not increase CPU clock speeds anymore.

(It is estimated that clock speeds will continue to slowly climb, finally reaching a permanent barrier around the 10 GHz range, which will be the absolute limit.)

Instead of fighting the inevitable, chip vendors are producing multicore chips (multiple processors on a single chip). Operating systems support multicore processors through symmetric multiprocessing (SMP), which means that an OS may schedule threads that are ready to run on any available CPU.

Adding a second core to a computer is analogous to the mechanic who adds a second car lift; in other words, the overall potential for work throughput is doubled.

This means consumers can achieve true multitasking because multicore CPUs can divide the applications across the available CPUs, as shown in Figure 1.

While the OS can handle multitasking for all the applications running on a system, software developers may want to create applications that are broken into unique tasks to use the available power in a multicore system. They can accomplish this by breaking an individual application into threads.

Multithreaded programming is the data, code and other information required by an operating system to execute instructions. A process is made up of one or more ‘threads’ and the threads represent unique instruction sequences that are independent of one another.

Threads share the same state of the process in which they reside and can access the same functions and data. In other words, the threads residing in the same process see the same things. Applications that are not multithreaded are executed in the main thread of an application.

One question that arises with moving to a multithreaded architectures is: How much code overhead is all this threading going to take up in an application?

As far as lines of code, it is pretty minimal — 99% of an application will be the same as it was before, with the other 1% spent on thread management and synchronisation.

Keep in mind, however, that the 1% of code may take a large percentage of the development time.

The real work required in developing a multithreaded application is not so much creating the threads, but instead managing the states of the threads and ensuring proper communication and synchronisation between the threads.

When the application is running, the threads can run in one of four states: running, ready to run, blocked or terminated. These states must be managed effectively so that the correct results of an application are achieved.

Because threads have access to the same memory, engineers must take care to use resources not needed simultaneously by some other caller in the application. If resources are needed simultaneously, a thread must ‘block’ another thread from using it inadvertently by employing structures called locks and semaphores.

An application that takes this into account is called ‘thread safe’. When multiple threads become hard to follow, common programming pitfalls can arise, such as the following inefficiencies due to too many threads:

Synchronisation plays an important role in defining the execution order among threads in an application. The most common method of synchronisation is called ‘mutual exclusion’.

This occurs when one thread blocks a critical section of code to prevent other threads from accessing that section until it is their turn.

To synchronise threads, primitives based on the lowest level of code implementation (atomic operations) are used. Examples of synchronisation primitives are semaphores, locks and conditional variables.

Consider an example of four threads running in parallel. Assume a master thread must wait until all threads are complete before proceeding.

In the first use case, the threads are not properly synchronised and the master thread inadvertently proceeds with an incorrect value (due to a race condition).

In the second case, the threads are properly synchronised such that the master thread waits until all other threads have completed, then executes with the proper value.

When combined with multithreaded programming, optimisation techniques can deliver tremendous performance improvements on multicore systems. Three example strategies include task parallelism, data parallelism and pipelining.

Task parallelism is one of the most intuitive ways to break up an application. Each function, or task, is broken up and the application is written in a multithreaded fashion based on the separation of tasks.

Data parallelism is a programming technique for splitting a large data set into smaller chunks that can be operated on in parallel. After the data has been processed, it is combined back into a single data set.

With this technique, programmers can modify a process that typically would not be capable of using multicore processing power, so that it can efficiently use all processing power available.

A technique for improving the performance of serial software tasks is pipelining. This divides a serial task into concrete stages that can be executed in assembly-line fashion.

Libraries and tools can assist with multithreading by simplifying some of the low-level details. Advantages include greater productivity in the current development cycle and, in some cases, more scalable code down the road.

Threading libraries present a method for developers to ease their transition to parallel code. An example is a programming model called OpenMP, which uses compiler directives to break up serial code and run in parallel.

Originally formed as an API in 1997, OpenMP is a platform-independent set of compiler directives that supports Fortran, C and C++. One key benefit of OpenMP is the ability to parallelise for loops.

Graphical programming tools can increase developer productivity by handling many of the multithreading intricacies under the bonnet. The fundamental benefit of graphical programming is that programmers can ‘see’ the parallelism in the code.

An example is the NI LabVIEW graphical programming language, which was originally developed in 1986 (native multithreading was added in 1998). LabVIEW is a fully compiled, graphical programming language based on structured data flow.

Data flow is represented between function blocks via connections in the block diagram. These connections also play an important role in multithreaded programming by serving as data dependencies and synchronisation elements.

Figure 3 shows a block diagram (source code) for a LabVIEW application that has three main elements: data processing, user interface and networking.

Similar to how OpenMP uses an optimal number of threads to execute parallel code, LabVIEW also offers scalability of threads depending on the underlying hardware.

For example, on a higher-end N core system, the LabVIEW compiler creates more system threads to better take advantage of all the available cores in the system.

By moving to a higher-level abstraction, developers may feel like they are losing the low-level threading control. This is a valid concern, so it is important that graphical tools provide a means for explicit threading in addition to any abstraction that is provided. In LabVIEW, for example, specific structures are provided that spawn unique threads. In addition, those threads can be assigned to a specific processor core using a processor affinity assignment and the thread pool itself can be managed at the application level.

Debugging plays a critical role in any development effort, but it deserves special mention with regard to multithreaded programming and the challenges that can arise.

In particular, the new multicore hardware architectures use a shared memory architecture, which can present some interesting cache contention issues that were not a problem previously.

Debugging tools can assist developers in providing insight into thread activity. This can be very useful on multicore systems to distinguish thread activity running on the different cores.

As hardware architectures move to multicore processors, multithreaded programming is becoming an even more critical technique for software developers to master.

After mastering the basics, programming strategies, including task parallelism, data parallelism and pipelining, may be implemented to further optimise performance.

Jeff Meisel is the real time and embedded product manager for National Instruments.

New chips bring new programming

Memristor-based hardware to advance AI

5G satellite tech advances global mobile connectivity

Can AI speed up critical communications chip design?

Content from other channels on our network