In this opening chapter of Part 2 of this book, you will learn about the SoC architecture definition phase, which precedes the design and implementation phase of the required SoC. This phase is performed by the system architects, who translate a certain set of product requirements into a high-level description of the SoC design to accomplish. We will also detail the criteria that will be used during the functional decomposition stage in which a trade-off is reached between what is better suited for implementation in hardware and what is rather a good target for software implementation. Finally, we will provide an overview of the SoC system modeling, which can use many available tools and environments.
In this chapter, we’re going to cover the following main topics:
This is the beginning of the pure technical stage in a project aiming to design an SoC. Usually, the technology to use isn’t specified at this stage, but there could be clear business reasons, as covered in Chapter 1, Introducing FPGA Devices and SoCs, that put the FPGA as the primary target technology for the SoC to design. These reasons can include (but are not limited to) the following:
There could be many other reasons for making the FPGA the best target for the SoC to design, which will then benefit the time to market and flexibilities such a choice offers. At this stage of the thinking process, the architecture to use is also centered around the Zynq-7000 SoC or the Zynq UltraScale+ SoC with a starting processer subsystem (PS) block. We still need to perform our feasibility study and confirm that the project objectives and the FPGA SoC capabilities are in line. Each FPGA SoC device family has a known set of features, capabilities, and associated device costs. An initial idea about the SoC capabilities to include and the major intellectual properties (IPs) to design is put together based on the input from the marketing team. The marketing team conducts many interviews with the key target customers of the product under design and gathers the key product requirements. It is at this moment that broad guidelines from the business team are needed to define the cost of the overall solution and product that the company is making. However, the business decision could be delayed slightly until the SoC cost is defined, the performance requirements are refined, and the time to design it is estimated. It is at this stage that the overall integration of the product cost could be revisited to figure out alternative business strategies if this requires further adaptations.
The overall product system architecture definition and the business strategy are outside the scope of this book, so we assume these have already been set up by the time the SoC architecture definition is considered. The remaining tasks are to decide which SoC can provide the required interfaces and deliver the performance of the product. This SoC architecture exploration phase will look at all the possible alternatives and the associated design costs in terms of the SoC hardware and software. Then, it will provide a report to the business decision-makers to approve the optimal choice for the project that meets the company strategy. There could be other ways and methods by which this decision is made according to the company’s culture, but this decision process is irrelevant to this book.
Additionally, there is also a need to define how much custom work will be needed to complement the SoC PS block’s functionalities, features, and performance. A detailed report on this will be provided as an architecture specification chapter in the following section, where we will cover the SoC hardware and software partitioning stage. It’s expected that IPs will be designed for this project, implemented in the programmable logic (PL) side of the FPGA, and integrated with the PS subsystem of the SoC. This also adds another criterion of choice between the SoCs to use from within the same family or among the available families.
As introduced in Chapter 1, Introducing FPGA Devices and SoCs, some SoCs specifically target some applications and industries in terms of their capabilities and features, as well as the availability of certain packages and certifications in their portfolio. This choice should also make sure that enough input/output (I/O) is available in the SoC package to use, and that the SoC can physically interface with the neighboring integrated circuits (ICs) in terms of supported PHYs and inter-device communication protocols.
These interfaces include the ones covered in Chapter 4, Connecting High-Speed Devices Using Busses and Interconnects, and Chapter 5, Basic and Advanced SoC Interfaces, as well as those briefly introduced in Chapter 1 of this book, such as the following:
The list should also state which version of the protocol standard these interfaces support, as well as what kind of backward compatibility is available if the revision of the standards and the generations aren’t the same between the SoC available interfaces and the neighboring ICs within the electronics board of the product. The physical characteristics of the FPGA device in terms of temperature range, package size, mechanical properties, and all required certifications per the industry vertical targeted by the product need to be considered before we start any architecture design work. For these, the list of device requirements should have been established by the project team at the technology target selection phase, before the SoC architecture design phase. This selection process is also outside the scope of this book. To give you an easier and clearer way to decide between the different Xilinx SoCs covered by this book, four tables (Tables 6.1 to 6.4) have been provided in this chapter that summarize their available features.
There are clear differences in the processing capabilities of the Zynq-7000 SoC and the Zynq UltraScale+ SoC PS blocks. There is no Cortex-R processor cluster in the Zynq-7000 SoC FPGAs, but applications that require some form of real-time profile and a deterministic processor type may be built in the PL side of the FPGA using the MicroBlaze processor, as introduced in Chapter 1. Then, some RTL integration is needed to interface it to the Cortex-A9 cluster over the ACP or simply over the AXI ports available for bridging from the PS block to the PL block in both directions. We will cover this design methodology and techniques later in this book when we cover the available co-processing methods in both the Zynq SoC families of FPGAs.
This subsection lists the PS block processors and their features per FPGA SoC type.
The following table lists the Cortex-A CPU features for both the Zynq-7000 and Zynq UltraScale+ SoC FPGAs:
Feature |
Zynq-7000 SoC |
Zynq UltraScale+ SoC |
Cortex-A cluster type |
A9 ARMv7-A |
A53 ARMv8-A |
Cores per cluster |
Single core or cluster of 2 cores |
A cluster of 2 or 4 cores |
ISA |
AArch32, 16-bit, and 32-bit thumb instructions |
AArch64, AArch32, and 32-bit thumb instructions |
Core performance |
2.5 DMIPS/MHz |
2.3 DMIPS/MHz |
Operation modes |
Both SMP and AMP |
Both SMP and AMP |
L1 caches |
L1 instruction cache of 32 KB L1 data cache of 32 KB |
L1 instruction cache of 32 KB L1 data cache of 32 KB |
L1 instruction cache associativity |
4-way set-associative |
2-way set-associative |
L1 data cache associativity |
4-way set-associative |
4-way set-associative |
L2 cache |
L2 shared cache of 512 KB |
L2 shared cache of 1024 KB |
L2 cache associativity |
8-way set-associative |
16-way set-associative |
SIMD and FPU |
NEON |
NEON |
Accelerator coherency port |
ACP |
ACP |
Core frequency per speed, grade, or device type |
(-1): Up to 667 MHz (-2): Up to 766 MHz or 800 MHz (-3): Up to 866 MHz or 1 GHz |
(CG): Up to 1.3 GHz (EG): Up to 1.5 GHz (EV): Up to 1.5 GHz |
Security |
TrustZone |
TrustZone and PS SMMU |
Interrupts |
GIC v1 |
GIC v2 |
Debug and trace |
CoreSight |
CoreSight |
Table 6.1 – The SoC PS block’s processor features
The following table lists the memory controllers and the storage interfaces available in both the Zynq-7000 and Zynq UltraScale+ SoC FPGAs:
Feature |
Zynq-7000 SoC |
Zynq UltraScale+ SoC |
DRAM controller ports |
4 |
6 |
DRAM controller standards |
DDR2/LPDDR2 DDR3/DDR3L |
DDR3/DDR3L/LPDDR3 DDR4/LPDDR4 |
DRAM controller maximum capacity |
1 GB |
8 GB Up to 16 GB for DDR4 |
DRAM controller ranks |
1 |
1 and 2 |
QSPI flash controller |
I/O and linear modes |
DMA, linear, and SPI modes |
OCM controller |
256 KB SRAM and 128 KB BootROM |
256 KB SRAM |
SRAM controller |
1 |
1 |
NOR flash controller |
1 |
N/A |
NAND flash controller |
1 |
1 |
SD/SDIO controller |
1 |
1 |
SATA controller |
N/A |
1 |
Table 6.2 – The SoC PS block’s memory and storage controllers
The following table lists the communication interfaces available in both the Zynq-7000 and Zynq UltraScale+ SoC FPGAs:
Interface |
Zynq-7000 SoC |
Zynq UltraScale+ SoC |
USB |
Device, host, and OTG |
Device, host, and OTG |
Ethernet |
2x (10/100/1,000 Mbps) |
4x (10/100/1,000 Mbps) |
SPI |
2 |
2 |
CAN |
2 |
2 |
I2C |
2 |
2 |
PCIe |
NA in the PS |
Gen2 x1/x2/x4 within the PS |
UART |
2 |
2 |
Table 6.3 – The SoC PS block’s communication interfaces
There are also many dedicated hardware functions built into the PS of both Zynq SoCs, as summarized in the following table:
Feature |
Zynq-7000 SoC |
Zynq UltraScale+ SoC |
DMA |
8x channels |
8x channels LPD 8x channels FPD |
ADC |
1x XADC |
1x SYSMON |
GPU |
N/A |
ARM Mali-400 MP2 |
PMU |
N/A |
MicroBlaze-based PMU |
Display controller |
N/A |
1x VESA DisplayPort v1.2a |
Table 6.4 – The SoC PS block’s dedicated hardware functions
In the SoC architecture exploration phase, we also need to know the following:
The speed grade classifies the FPGA SoCs in terms of the maximum frequency the design elements’ PS and PL can run at; there is a higher price tag attached to a higher FPGA SoC speed grade. These details are too vast to summarize here, but all these are provided by Xilinx in the FPGAs selection guides documentation. You are encouraged to read the Zynq-7000 SoC FPGAs Product Selection Guide at https://docs.xilinx.com/v/u/en-US/zynq-7000-product-selection-guide and the Zynq UltraScale+ SoC FPGAs Product Selection Guide at https://docs.xilinx.com/v/u/en-US/zynq-ultrascale-plus-product-selection-guide to learn more.
The key defining elements of the specific FPGA SoC to choose are the PS block’s required processing power and the amount of FPGA logic elements necessary to build the custom hardware that implements the key acceleration functions or the company IPs. It is logical to start with the most cost-effective option and build on it by adding more features as more of the details are unveiled, thus moving on to the next target device. Since the hardware and software partitioning phase hasn’t been accomplished yet, it is a good idea to perform a technical assessment to estimate the main functions that will be executed in hardware that are impossible to run in software, or that are company or third-party IPs forming part of the overall SoC architecture. Also, any new company IP that is to be built into the hardware will be listed, which will give us an idea in which direction the choice should be heading.
For this chapter, a practical example is the best approach to apply the ideas and suggestions listed thus far since we will be implementing this first simple but complete design in the next chapter. Therefore, we will start with the SoC architecture, which is based on a Zynq-7000 SoC since we would like this exercise to be simple and illustrative.
We need to perform the following tasks:
In Part 3 of this book, we will be building more complex SoCs that require higher processing power and therefore potentially targeting the Zynq UltraScale+ SoC. We will also look at performing system profiling to help us implement custom acceleration hardware that we will also integrate into the SoC. We will also build the necessary software drivers for these custom functions that will run under an RTOS such as Embedded Linux.
To conclude the architecture exploration phase, we must come up with a list of possibilities that have advantages and disadvantages. By doing this, we can compare them in terms of cost, design, verification effort, and time. We usually discuss these with the business stakeholders and decide on the best option. After this, we start mapping the functions of the SoC on the processing elements (PEs) that are in either hardware blocks or software functions and start the next phase of the SoC architecture development.
As mentioned in the previous section, the architecture devolvement task of mapping the functions of the SoC to the PEs available is better exercised using a practical example.
To perform hardware and software partitioning, let’s design an SoC that implements the intelligent parts of a dummy financial Electronic Trading System (ETS). It’s a dummy since it isn’t a system that we can use to perform financial transactions in an Electronic Trading Market (ETM) managed by a specific private organization; it just behaves like one. Most financial ETSs are co-located in a data center managed by a private organization. The interface between the ETM and the trading clients is a network switch where the trading clients plug in their network interfaces, which connect them to the ETM. The market itself is a network of servers that broadcasts the market data over, for example, the user datagram protocol (UDP) and receives trade transactions and their confirmation over, for example, the transmission control protocol/internet protocol (TCP/IP). Both UDP and TCP/IP are part of the Internet Protocol (IP) suite, commonly known as the TCP/IP stack, and are widely used in computer networking and communications architectures. Further details on the TCP/IP protocol suite can be found at https://datatracker.ietf.org/doc/html/rfc1180.
In the envisaged ETM wider network architecture, every client is connected to the trading market switch over Ethernet interfaces, and listens to the market information over the UDP broadcasted packets. The information is specific to the ETM itself and is formatted as we see fit. Our system should be able to cope with any formatting used and be able to adapt to it if it’s updated by the ETM. The ETM organization uses trading symbols that represent a financial market product. Information about these symbols, such as the asking prices, the volumes, the transactions on a given symbol, and many other details, are broadcasted by the ETM. This information is what the ETM has pre-formatted and what the clients listen to while electronic trading is open. The clients receive the ETM information over UDP via their Ethernet interfaces from the ETM switch, decode its content, filter it, make decisions in software (or accelerated software in hardware), and then inform the market of any buying or selling decisions over their TCP/IP connection. As we can imagine, the client with the fastest round-trip communication and processing is the one who can maximize their gains and be the first to exploit a good opportunity, such as a low asking price for a symbol they are targeting and wish to invest in. This race to zero latency is at the heart of the low-latency and high-frequency ETMs that exploit the superior technological solutions a trading organization may possess to drive the market in one direction or another. The following diagram depicts the simplified electronic financial market concept:
Figure 6.1 – Electronic trading data center concept
For our SoC architecture development exercise, we need to design an SoC that can do the following:
The electronic trading SoC is part of a trading server:
The preceding list of capabilities is just the bare minimum to implement in the SoC to design the ETS. The objective is to design a system with the lowest latency possible and use all the possible techniques to make such a trading system as fast and secure as possible. We will have many other questions as we design the SoC architecture, but these should be on the details side rather than a fundamental architecture issue. We will start by putting all the listed capabilities and the options we can use to implement them in a table. Then, we will cover the overall implementation for every capability before seeing whether they are better suited for software, hardware, or a combination of both in our low-latency ETS. The following table classifies these capabilities and their possible implementation options:
Capability |
Hardware |
Software |
Both |
[CP1] |
√ |
√ |
√ |
[CP2] |
√ |
√ |
√ |
[CP3] |
√ |
√ |
√ |
[CP4] |
√ |
√ |
√ |
[CP5] |
√ |
√ |
√ |
[CP6] |
√ |
√ |
√ |
[CP7] |
√ |
√ |
√ |
[CP8] |
√ | ||
[CP9] |
√ |
√ |
√ |
[CP10] |
√ |
√ |
√ |
[CP11] |
√ | ||
[CP12] |
√ | ||
[CP13] |
√ |
√ |
√ |
[CP13] |
√ |
√ |
√ |
[CP14] |
√ |
√ |
√ |
[CP15] |
√ |
√ |
√ |
Table 6.4 – Electronic trading SoC capabilities classification
As you can see, all the capabilities except database management can be implemented in hardware only, software only, or using both software and hardware. We have excluded the database management as it is a background task not worth designing a mechanism for from scratch in RTL. At this stage, we are also looking to assess the effort required to design a capability in hardware since we are assuming that it will be faster when executed in a hardware PE specifically designed for it. Most of the time, this is true. But the real question to ask is whether this speed-up is worth the effort, the time spent, and the implementation cost. To do so, we need to draw a back-to-back data communication pipeline for the ETS, highlight the critical paths in this communication pipeline that are sensitive to time, and understand what it would mean if we were to move a capability from software to execute it in its custom form to be designed by a PE. We are assuming that the easiest implementation (but not necessarily optimal in terms of speed) is putting everything in the software. Once a capability is moved from software to hardware, we can understand what the interface between the two looks like. This is important and by itself requires further consideration.
If we look at our ETS, the back-to-back communication pipeline that is sensitive to latency is from the time a UDP packet hits the Ethernet port of the SoC to the moment a TCP/IP packet with a trade decision is sent from the SoC back to the switch, acting as the interface with the ETM via the same or another Ethernet port. The fastest solution is to design everything in hardware and use IPs that are designed for very low latency, but this is the spirit of a high-frequency trading system and defeats the purpose of our book’s objective, which is learning how to design SoCs and integrate PL-based IPs with them using both the PS and PL blocks of the FPGA. We would still like to design a low-latency solution, but we don’t want to design an exotic TCP/IP stack and middleware, or any hardware-based real-time operating system (RTOS) in RTL. Therefore, we will use a simpler approach where we can use hardware acceleration when it makes sense to meet our design objectives. Our back-to-back data communication pipeline does the following:
The server side will not be included in our simple example design of this electronic trading SoC, but it may be a good addition to cover PCIe. This will be covered in the advanced applications of this book in Part 3.
The following diagram summarizes our electronic trading receive communication path:
Figure 6.2 – Electronic trading receive communication path
The preceding diagram illustrates the frontend of the electronic trading communication path, where the data of interest for trading can be easily highlighted, analyzed, and then mapped to either a hardware PE, a software PE, or a combination of both. The backend communication path is assumed to be well-adapted to software, so it will be implemented as such. This is why it isn’t shown in the preceding diagram. It needs a TCP/IP stack to communicate trades to the ETM. The TCP/IP protocol is an enormous task to implement in hardware, so it’s not worth the cost and effort to find out whether it is available as a third-party IP to license for our project. The latency gains it may give us aren’t important for this specific ETS. The software trading algorithm has three high-priority tasks that run on the Cortex-A9:
As mentioned previously, we are not focusing on the server side of the system in this architecture definition example.
Now, let’s analyze these paths and highlight the critical paths in this ETS to evaluate which PE type or combination will be used to implement them.
From Figure 6.2 and the previous descriptions, we can conclude that the critical paths for our low-latency trading SoC are as follows:
Therefore, to make a low-latency solution, all the corresponding PEs required to fulfill the tasks involved in these two critical paths for our low-latency electronic trading system need to be implemented in hardware; putting them in software isn’t going to be fast enough. Now, let’s revisit Table 6.4 and update it so that it suits our hardware and software partitioning exercise:
Capability |
Hardware |
Software |
Both |
[CP1] |
√ | ||
[CP2] |
√ | ||
[CP3] |
√ | ||
[CP4] |
√ | ||
[CP5] |
√ | ||
[CP6] |
√ | ||
[CP7] |
√ | ||
[CP8] |
√ | ||
[CP9] |
√ | ||
[CP10] |
√ |
√ |
√ |
[CP11] |
√ | ||
[CP12] |
√ |
Table 6.5 – ETS capabilities partitioned between the hardware and software PEs
Please note that we have removed [CP13], [CP14], and [CP15] from the table as we are not designing the Server PCIe integration side of the SoC in this initial architecture design example.
We still need to define how we will be managing the split between hardware PEs and their peer software PEs, and what the interfacing between them should be. We also need to implement a way by which Ethernet packets that aren’t of interest to our hardware acceleration paths are returned to be consumed by the software directly, since that would be done if no hardware acceleration was implemented. We also need to study the consequences of our partitioning on the Ethernet controller software driver and make the necessary changes to adapt it to our new hardware design. All these important details will be covered in the following section.
The acceleration path introduced in the end-to-end communication path between the ETM switch and the software running on the Cortex-A9 is only necessary for the UDP packets received over the Ethernet port. Anything else that’s received over this Ethernet port, such as Ethernet management frames and ARP frames, we have no interest in accelerating, so we would like the acceleration path to be transparent to them. However, we can’t do this by simply returning the received Ethernet packets that aren’t of interest to our acceleration hardware to the Ethernet controller. This is because the received buffer within the Ethernet controller is a FIFO that the Ethernet frames can’t be written back to once they’ve been consumed from it. One approach would be to let the Cortex-A9 processor perform the Ethernet frames receive management and pass them to the hardware acceleration PE, which behaves like a packet processor. This is the best approach that will introduce less work for the implementation and specifically the Ethernet controller software driver, as well as the mechanism by which the hardware acceleration PE is notified of the arrival of new frames. The hardware DMA interrupt is hard-wired to the Cortex-A9 GIC, and as such the easy way to pass notifications is via the Cortex-A9, which will be acting as a proxy in this respect.
In the data reception model of the Ethernet controller, which is using its DMA in the Zynq-7000 SoC, the Cortex-A9 software sets the DMA engine within the Ethernet controller to transfer the received Ethernet frames to a destination memory. Then, the DMA engine notifies the Cortex-A9 via an interrupt. We want this Ethernet frame transfer to be done in memory located within the PL or to the OCM or the DDR DRAM memory. When the Cortex-A9 receives an interrupt from the Ethernet DMA engine when the Ethernet frames received are transferred to the nominated memory, the Cortex-A9 rings a doorbell register within the hardware accelerator domain to notify it that several Ethernet frames have been received. It also tells it how many of them have been received. First, we want to filter these Ethernet frames, extract the UDP frames from them, and then put back the other Ethernet frames where the Cortex-A9 is expecting them. It is only after we filter in the hardware acceleration engine and store it in the receive memory of the non-UDP packets that a second notification is sent to the Cortex-A9 via another interrupt. This subsequent interrupt will tell the Cortex-A9 that Ethernet frames have been received that it needs to deal with. The first DMA receive interrupt to the Cortex-A9 was just for the Cortex-A9 to forward them due to the hardware filtering and processing PE. Breaking the received data path model using the Ethernet DMA engine and its associated interrupt notification still has another aspect that we must deal with, which is the data exchange interface mechanism between the Ethernet DMA engine and the Cortex-A9 software. This exchange is done via the DMA descriptors that are prepared by software for the DMA engine to use when data is received by the Ethernet controller. As mentioned in Chapter 4, Connecting High-Speed Devices Using Busses and Interconnects, these descriptors specify the data local destination and the next pointer of the descriptor, as well as an important field known as the Ownership field. The Ownership field tells the DMA engine that the DMA descriptor is valid and has been consumed by the software from its previous use and that it is ready to be reused again.
We need to make sure that the task that’s recycling the DMA descriptors after consuming their associated received data is performing this task properly and performed for the UDP packets that have now been consumed by the hardware PE, not the Cortex-A9 software anymore. This is fine as we simply need to include this mechanism between the hardware PE and the DMA Descriptors Recycling Task (DDRT) in software via a DMA Descriptor Recycling Queue (DDRQ) via which the hardware accelerator engine sends requests to the DDRT running in software on the Cortex-A9. Another issue we have introduced in this reception model over the DMA engine of the Ethernet frames is that the consumer of the Ethernet frames is not only the Cortex-A9 but both the hardware acceleration engine and the Cortex-A9. This model introduces an out-of-order consumption of the Ethernet frames, but this isn’t a problem as the out-of-order would have been noticeable if it introduced some discrepancy in the DMA descriptors reuse model, which should have enough entries to make this reordering fine. That is, by the time we get to reuse a continuous set of DMA descriptors for subsequent receive operations, both the software and hardware would have been filled with their associated data. Even if the hardware had finished first with its DMA descriptors flipping the Ownership field, the software would have had time to catch up.
If there is a dependency between the received Ethernet frames, this shouldn’t be an issue as the ETP guarantees that no change is introduced in the protocol until all the clients have acknowledged the reception of its update and have adjusted to it. This condition requires having a large enough DMA descriptors pool that the slowest consumer is allowed to finish while the Ethernet controller keeps up with the speed at which the Ethernet frames are arriving. This can easily be computed using the maximum rate at which we expect the Ethernet frames to be arriving. The Ethernet interface’s default receive path includes the controller hardware, the associated receive DMA engine, and the Ethernet software drivers. To minimize the changes in the receive path, which now includes the added UDP packets filtering, the hardware acceleration engine will deal with the filtering task on a set basis. It will be sending a job completion notification to the Cortex-A9 when it has consumed all the Ethernet frames that have been received within the last set. When the hardware acceleration engine finds a UDP packet within the received Ethernet frames set from the ETM, it does the following:
All non-treated Ethernet frames (which are not UDP packets) are left for the Cortex-A9 to consume and are left as follows:
In this model, we are keeping everything looking almost the same in the flow of processing, without hardware acceleration of the UDP packet processing. Here, we are just delaying the processing of the non-UDP packets by the Cortex-A9 until the hardware acceleration engine has had the chance to inspect the received Ethernet packet, deal with the ones found to be market data UDP packets, notify the market treading tasks about the UDP packets of interest, and request the DDRT to mark the DMA descriptors of these frames as consumed before sending a final notification to the Cortex-A9 so that it can deal with the remaining non-UDP packets, if any, recycle their corresponding DMA descriptors back to the pool, and call the Ethernet driver to finish the receive flow.
The changes in the hardware model will require some associated changes to be made in the Ethernet controller software model, but this can easily be done since the Ethernet receive path can function in delayed mode. Therefore, the changes we are introducing by adding the filtering path can be added almost silently from the Ethernet controller driver’s perspective. It is just what the Cortex-A9 does now when receiving the Ethernet DMA receive notification that has changed from calling the Ethernet frame processing to passing them to the hardware PE, and then waiting for the hardware PE notification to start processing any remaining received Ethernet frames that weren’t accelerated. It is here that the UDP that was found to be urgent and that matches the filters is dealt with by the hardware PE. To summarize, the hardware-to-software interfacing and communication process goes as follows:
The following diagram illustrates these steps:
Figure 6.3 – ETS low-latency path hardware to software interaction
The Semi-Soft algorithm idea isn’t new, and it has been around for many decades now since FPGA technology became prevalent in the electronics industry. It is a combination of hardware and software from the initial architecture development stages and is used to implement compute algorithms.
When targeting a Zynq SoC, the hardware and software split between the PEs is considered from the start of the project rather than all the computing algorithms in the software being implemented, profiled, having the bottlenecks pinpointed (if any), and then hardware-accelerated. There is nothing wrong with the latter approach – it is just limiting in what can be achieved when targeting an ASIC technology to implement the design. Once the overall SoC architecture and microarchitecture have been defined, it is hard to modify the design at a later stage and introduce the optimal communication paths between the hardware and the software. When the SoC targets an FPGA, as in our case, the simpler approach should consider the acceleration from the architecture phase, as we have, so that we can put packet processing compute algorithms in place that are Semi-Soft. This means we can modify them easily in both the hardware and the software, so long as we have the appropriate interfacing between the two.
A Semi-Soft algorithm is a computational methodology that uses both the hardware and the software from the start of the architecture design phase. In our example, we are focusing on packet processing, which can greatly benefit from the Ethernet packets being inspected in parallel. Since all the UPD fields can be checked in parallel, many packets can also be checked simultaneously, and many decisions and results can be found and forwarded to the Cortex-A9 in parallel. Although the Cortex-A9 will have to deal with them sequentially when using a single-core CPU, it is capable of processing them in parallel if we deploy a dual-core CPU cluster, where each core is dedicated to a trade (buy or sell) queue.
Consequently, we continue maximizing parallelism and making faster trade decisions. The possibilities are enormous when we consider the compute as a mixture of generic software sequential compute resources of the PS and the parallel and multi-instance possible hardware acceleration engines of the PL. These could augment the PS capabilities to make a powerful custom hybrid processor. In our ETS, the Cortex-A9 manages the Ethernet DMA engine, recycles the DMA descriptors on behalf of both the hardware and software, and runs the middleware for the TCP/IP stack and other Ethernet link management. On the other hand, the hardware performs the Ethernet frame filtering and UDP packet processing, and then notifies the software to complete any remaining packet processing that does not need to be low latency or is complex to perform in hardware.
A design methodology that has a flexible way of moving PE elements in and out of the PL to replace elements from the software tasks or to be implemented instead in software can be matched with other, more complex, hardware electronic systems. One example is when using a PCIe interconnect hosting a hardware accelerator such as a GPU, which can be used to accelerate parallel compute operations for the main processor. We can think of another architecture that uses discrete electronic components that can do the packet processing task. A network processor (NP) hosted over the PCIe interconnect of a server platform can perform the packet processing tasks, but as can be imagined, first, the cost is at least an order of magnitude higher, the design framework is complex, and the required technical skills to put such a system together is also demanding in comparison to using an SoC-based FPGA. Also, using a server-based PCIe approach won’t be low latency; it may be a good approach for high-volume traffic processing, but it’s not necessarily comparable for low latency. In this architecture, we have added more layers to the communication stack between the software and hardware accelerators. Here, the platform can only efficiently perform packet processing if we need to accelerate another aspect of the software tasks that were found to be inefficient in the software via system profiling. The FPGA SoC provides us with flexibility and a paradigm shift toward the architecture design phase via an early system implementation that we call the Semi-Soft algorithm concept.
Other solutions provide a whole framework for performing this mixing by defining hardware functions and integrating them into a kernel, such as the OpenCL approach. However, OpenCL tries to abstract compute operations from the compute engines performing it and defines methods to match them. You can find out more about OpenCL at https://www.khronos.org/opencl/.
Another methodology that was also used in the past is to exploit the FPGA partial reconfiguration feature, where a block of the FPGA is defined as a reprogrammable function and used on demand by software running on the same FPGA or externally and interfaced to the FPGA. This means that the design can define many computational functions that can fit within this reprogrammable block of the FPGA, and when a computation acceleration is required, the reprogrammable block is reconfigured to perform this specific hardware acceleration task. This reprogrammable block should have a predefined interface with software through which data and commands are provided to the hardware function by software, and results and statuses are generated by the hardware accelerator hosted in this reprogrammable block.
More information on the FPGA partial reconfiguration can be found at https://docs.xilinx.com/v/u/2018.1-English/ug909-vivado-partial-reconfiguration.
When targeting an FPGA for an SoC implementation, reimplementing the design takes a matter of days or even sometimes just hours. It involves changing the behavior of an RTL block or drop in a verified IP. Doing so when the target technology is an ASIC is a lengthy process with significant costs in terms of resources and budget.
We will introduce system modeling in the closing section of this chapter as it is part of the architecture development in general and it is also becoming a time-to-market solution for system implementations that take a long time to accomplish, specifically when targeting an ASIC technology. The industry is exploiting the availability of detailed system models of processors, interconnects, and all the IP elements of an SoC to build virtual SoCs. These are system models that emulate the functional behavior of an SoC in software simulation. Many frameworks are available that can put system models together. These system models behave functionally like the target hardware SoC, and software can be prototyped on them earlier than the SoC hardware and electronics board availability. Timing accuracy is sometimes sacrificed to accelerate the timely execution of the system model in the simulation itself when the system model is used for early software development. In this case, we produce a version that is called a virtual prototype (VP); quantum or simulation time is sometimes fast forwarded to areas of the prototyped software execution of interest to the developer.
The major frameworks of system modeling are as follows:
The final timing accurate system model is what we call a golden model. It is usually built according to the SoC architecture specification. This can be checked against the system’s final verification so that it can be benchmarked for its correct functionality and achieved performance.
Many IP vendors provide SystemC and TLM2.0 models of their IPs so that they can be used in system modeling to perform the early architecture exploration activities and build VPs for early software development. As mentioned previously, this captures the system architecture specification and helps make a golden model against which the functional verification of the system’s RTL can be compared.
SystemC and TLM2.0 are becoming one of the de facto standards for building IP models and connecting them in a transaction-oriented way using TLM2.0 socket-based interfaces. SystemC and TLM2.0 are available for free from Accellera, which also provides the SystemC basic simulator. For more information on this framework, go to https://www.accellera.org/downloads/standards/systemc.
This framework is known as the Open SystemC Initiative (OSCI), which is now owned and maintained by Accellera. It is an Institute of Electrical and Electronics Engineers (IEEE) standard under IEEE 1666-2011. SystemC is a C++ class library and provides macros that allow you to build concurrent processes that can communicate between themselves through function calls and returns. It is also associated with transaction-level modeling (TLM), a socket-based programming methodology also based on the C++ library and macros. It provides many ways of mimicking handshaking data and controlling information exchange protocols and helps in building SoC interconnects. While this framework is available for free and can be used to build initial system models, there is a lack of availability for free specific processors and interconnect IPs system models, so the IP vendor needs to be contacted in case they have them available for licensing.
Accellera (or OSCI) SystemC/TLM2.0 is great for building system IP models if we are using RTL, so it provides a rapid prototyping methodology for custom IPs. It is also very useful in building system-level prototypes where no specific CPU ISA is required as it has a SystemC/TLM2.0 model of a generic RISC CPU. This can be used with Assembly language for sequential operations, but this will only allow accurate system modeling that isn’t CPU-centric, which means it’s not very useful for SoC VPs on their own. However, it can be combined with other frameworks as most of the VPs have a bridge to SystemC/TLM2.0, which means they can still be part of the tools a system architect uses to build a final system model of the targeted SoC with freely available tools.
Platform Architect is a graphical user interface (GUI)-based system modeling environment from Synopsys. It is based on the SystemC/TLM2.0 framework but has a lot of utilities for making the system modeling and the architecture exploration tasks easier than it is on the OSCI framework. Synopsys has built many system models for their IPs and integrated many third-party IPs within Platform Architect, which can be dragged and dropped in the GUI to stitch a system model together. This comes with a significant price tag, so some third-party IP models – specifically, the CPUs and interconnects – need an additional license from their providers to be able to use them in any system modeling work. For major SoC projects targeting ASICs, Synopsys Platform Architect could be a good solution, but for FPGA-based SoCs, other alternatives for system modeling should be considered.
Further information on the Platform Architect framework is available at https://www.synopsys.com/verification/virtual-prototyping/platform-architect.html.
gem5 is used in computer architecture research and is a framework for performing multi-processor system simulation, which it does by assembling computer system elements. The framework is centered on the CPU models that are built into C++ and configured using a Python script. This framework is suitable for SoC system modeling as it is also a method by which SoC component models such as processor cores, interconnect, memory interfaces, and peripherals can be connected using the configuration Python script to form a custom SoC emulating the real SoC hardware. Custom IPs can be written to extend the portfolio of gem5 and allow the user to produce a system model for the SoC target architecture. gem5 is free and provided under a BSD-like license. A full system can be built, including hardware, operating systems, and application software, to be run on the gem5 simulator. It supports two operating modes:
For the SoC architecture development tasks, FS mode is suitable for system modeling, whereas SE mode is better suited for early software prototyping and software-centered research. CPU models aren’t an exact RTL translation of their architecture implementation, but it uses an execution model with support of the specific Instruction Set Architecture (ISA), which is a good enough execution environment for architecture exploration-associated work. Using gem5 requires a good understanding of the CPU microarchitecture, including its internal memory hierarchy and its settings to customize the CPU using the Python script. This approximates it to a system model that emulates the targeted CPU architecture to be used within the SoC. SystemC/TLM2.0 models are supported by gem5 either by bridging from gem5 to SystemC/TLM2.0 using a full bridge model available in gem5 or by running the SystemC/TLM models within the built-in SystemC/TLM2.0 simulation kernel. This last method is near native in the integration and configuration of SystemC/TLM2.0 models since they can be integrated and configured into the SoC system model using Python, such as the native IP models of gem5. More details on the gem5 simulator can be found at https://www.gem5.org/.
QEMU is a machine emulator with multiple operating modes:
System emulation mode is the one that’s relevant to SoC system modeling as it emulates the full SoC elements. It can boot guest operating systems and emulate many ISAs, including ARMv7, ARMv8, and MicroBlaze. QEMU is a free and open source framework licensed under GPL-2.0.
For more information on QEMU, go to https://www.qemu.org/.
Xilinx has a QEMU port that’s provided as a VP for SoCs built using the MicroBlaze processor, and both the Zynq-7000 and UltraScale+ SoC FPGAs. The platform can connect to custom IPs written in SystemC/TLM via an interface from the Xilinx QEMU.
For more information on Xilinx QEMU, check out its User Guide at https://docs.xilinx.com/v/u/2020.1-English/ug1169-xilinx-qemu.
This chapter opened Part 2 of this book, which has a practical aspect to it since we will be putting the theoretical topics that were introduced in Part 1 to use. This chapter was purely architectural since we need to understand why certain choices that we implement in an SoC design are the way they are. We also need to be capable of making certain changes to the design microarchitecture while considering the overall aspect of the system we are designing and whether we have met the stated objectives. This chapter covered all the major steps involved in SoC architecture design. We started by covering the exploration phase, where the possible design options are studied and compared in terms of cost, implementation effort, and time. We proposed a comparative method by which the initial theoretical analysis can be conducted and how the thinking process of choosing a potential solution can be driven. Then, we moved on to the next stage of the architecture definition, which was very analytical and was conducted practically on an example SoC, known as the ETS, which implements a low-latency dummy trading engine but behaves very much like one. We performed the hardware and software partitioning tasks on this trading engine by decomposing the SoC microarchitecture into many elements classified by tasks. While targeting the possible processing elements, we also looked at the end-to-end data path and what would make it low latency and easy to implant before putting together a table containing the choices we made. After that, we had to figure out how to make these processing elements communicate with each other by considering their specific characteristics. We listed these interfaces and how they should be dimensioned when such quantification is needed and makes sense. This collaboration between the software and hardware processing elements naturally led us to cover the Semi-Soft algorithm concept, where we covered its importance for SoC-based FPGA designs. We also covered many existing frameworks that are currently using it, starting from OpenCL, FPGA partial reconfiguration, and the simple hardware acceleration method we introduced in this chapter. We concluded this chapter by introducing the last stage of the architecture definition, known as system modeling. We covered how helpful it is nowadays for complex designs that specifically target ASIC technologies. We closed this chapter by providing the major frameworks currently used in the industry to perform system modeling and virtual prototyping.
In the next chapter, we will continue in the same vein by taking the ETS from its architecture definition to its implementation on the Zynq-7000 SoC FPGA. By doing so, you will learn how to translate the architecture choices into an implementable SoC for the ETS.
Answer the following questions to test your knowledge of this chapter:
3.239.76.211