dipl/sections/soc/soc.tex

\documentclass[../../Diplomschrift.tex]{subfiles}

\begin{document}
\section{SoC Peripherals}

The complete FPGA design consists not only of the CPU core, but a number of components that allow it to operate as well as to communicate with the outside environment. They are connected to the core using a shared 32-bit bus.

\subsection{UART}

The easiest way to communicate with an embedded system is usually through a serial interface. To ensure the best compatibility with existing software, a National Semiconductor 16550 UART was reimplemented from scratch instead of creating a new design. Thus, the modules's functionality and design can be found in the 16550's datasheet~\cite{pc16550}.

\subsection{DVI graphics}

As can be seen in \autoref{fig:video-core-diagram}, the graphics module consists of several subcomponents:

\begin{itemize}
 \item The VGA timing generator creates the impulses and counters necessary to drive a VGA-, DVI- or HDMI-based display
 \item The text renderer draws text characters onto the screen using a built-in font ROM
 \item The TMDS encoder frontend converts the internal parallel signals into a set of high-speed serial streams necessary for DVI or HDMI.
\end{itemize}

\begin{figure}[h]
\includegraphics[width=\textwidth]{graphics.png}
\caption{Block diagram of the video core}
\label{fig:video-core-diagram}
\end{figure}

\subsubsection{VGA timing}

The timing of VGA signals dates back to analog monitors. Even though this original purpose is only very rarely used nowadays, the timing remained the same for analog and digital DVI all the way to modern HDMI.

In analog screens, the electron beams (one for each primary colour red, green and blue) scan across the screen a single horizontal line at a time while being modulated by the colour values, resulting in a continuous mixture of all three components. When a beam reaches the end of a scanline, it continues outside the visible area for a small distance (the ``Front Porch''), is then sent to the beginning of the next line by a pulse of the hsync (Horizontal Sync) signal, and draws the next line after another short off-screen period (the ``Back Porch'').

The same applies to vertical timings: after the beam reaches the end of the last line, a few off-screen Front Porch lines follow, then a pulse of the vsync (Vertical Sync) signal sends the beam to the top of the screen, where the first line of the next frame is drawn after several invisible Back Porch lines.

\begin{figure}[h]
\includegraphics[width=\textwidth]{vga_timing.png}
\caption{Diagram of VGA timing intervals}
\label{fig:vga-timing}
\end{figure}

The VGA timing module generates these hsync and vsync signals as visualized in \autoref{fig:vga-timing}, along with a blanking signal (active during any front porch, sync and back porch) and, while in the visible area (i.e. not blanking), the row and column of the current pixel relative to the visible area.

\subsubsection{Text renderer}

The text renderer converts a logical representation of a character, such as its ASCII code (henceforth referred to as its \textit{codepoint}) to a visual representation (a \textit{glyph}). This conversion is achieved using a \textit{font}, a mapping of codepoints to glyphs.

\begin{figure}[h]
\includegraphics[width=0.7\textwidth]{text_renderer.png}
\caption{Block diagram of the text renderer}
\label{fig:text-renderer-diagram}
\end{figure}

As can be seen in \autoref{fig:text-renderer-diagram}, the current pixel coordinate (created by the VGA timing generator) is split up into two parts: the character index, which specifies the on-screen character the pixel belongs to, and the offset of the pixel within this character. The character index is passed to the text RAM, which contains the codepoint for each on-screen character. This codepoint, along with the pixel offset, is looked up in the font ROM to determine the colour of the pixel.

\subsubsection{TMDS encoder}

DVI and HDMI are serial digital transmission standards. Three data lines (corresponding to red, green, and blue channels) along with a clock line transmit all colour information as well as synchronization signals. The encoding used for these signals is Transition-Minimized Differential Signaling (TMDS). It is a kind of 8b/10b encoding (transforming every 8-bit chunk of data into a 10-bit chunk) that is designed to minimize the number of changes of the output signal.

\subsection{Ethernet}

The Arty development board contains an RJ-45 Ethernet jack connected to an Ethernet PHY. The PHY handles the physical connection to an copper twisted pair Ethernet network (Layer 1 of the OSI model) and exposes a standardized media-independent interface (MII) to the FPGA. The LiteEth core~\cite{liteeth}, which is released under a Free Software license, is used to integrate the Ethernet interface into the SoC.

\subsection{WS2812 driver}

A hardware driver for WS2812 serially-addressable RGB LEDs is also included in the SoC. It was developed independently as part of the curriculum at HTL and later incorporated into the SoC.

\begin{figure}[h]
\includegraphics[width=\textwidth]{ws2812.png}
\caption{Block diagram of the WS2812 driver}
\end{figure}

\begin{figure}[h]
\centering\includegraphics[width=0.5\textwidth]{ws2812_timing.png}
\caption{Timing diagram for the WS2812 serial protocol}
\label{fig:ws2812_timing}
\end{figure}

The driver is designed to be attached to external circuitry that provides colour data for any given LED index (address). This can either be discrete logic that generates the colour value from the address directly, or a memory that stores a separate colour value for each address.

The LEDs are controlled using a simple one-wire serial protocol. After a reset (long period of logic 0), the data for all LEDs is transmitted serially in one single blob. Each LED consumes and stores the first 24 bits of the stream and applies them as its colour value (8 bits each for red, green, blue), all following bits are passed through unmodified. The second LED thus uses the first 24 bits of the stream it receives, but since the first LED already dropped its data, these are actually the second set of 24 bits of the source data.

Every bit is encoded as a period of logic 1, followed by a period of logic 0. The timing of these sections determines the value, see \autoref{fig:ws2812_timing}.

The exact timing differs between models, so all periods can be customized using generics in the VHDL entity.

\subsection{DRAM}

The Arty A7 development board contains a 256MB DDR3 memory module. Since the FPGA only contains about 1.8MB of block RAM, some of which is already reserved for various hardware functions (e.g. the text buffer and WS2812 driver), the external memory is absolutely necessary to run larger programs.

Interfacing with DDR3 memory is notoriously difficult, requiring complex logic on both physical and logical layers. For this reason, the Free Software LiteDRAM core~\cite{litedram} is used to integrate the entire memory interface into the SoC. While irrelevant to the SoC, it can still be considered a slight peculiarity that the LiteDRAM core actually contains an entire separate RISC-V core to coordinate initialization of the memory.

\subsection{External Bus}

Bridging the internal SoC bus with the external peripheral bus requires a few steps. For one, the external data bus is bidirectional, so tri-state outputs must be used on the FPGA. In addition, the internal bus arbitrates components using addresses alone, while the external bus uses chip enable signals and overlapping address spaces. Lastly, the bus must be slowed down. While the internal bus runs at a frequency of 50 MHz, a reasonable frequency for the external circuitry is around 1 MHz. To achieve this, a clock divider is used to only change the state of the external bus interface every 64th clock cycle, resulting in an effective bus speed of under 1 MHz.

Due to a mistake in the adapter board layout, the nibbles of the address and data buses are reversed (MSB to LSB are pins 7 to 0 on the FPGA, but 3 to 0 followed by 7 to 4 on the board). Thanks to the completely arbitrary mapping of FPGA pins, this can be mitigated without using any additional resources.

\section{Software}

\subsection{Bootloader}

The CPU loads its machine code from an FPGA-internal block RAM. The initial value for this RAM is part of the bitstream, and if any changes to it are required, the entire project has to be resynthesized. Because this takes upwards of 5 minutes, a different solution was created: a fixed bootloader is encoded into the block RAM, which is able to read additional program code (the payload) from the UART at runtime and store it to available memory. After the transfer is complete, it simply jumps to the base address of the payload and continues execution from there. When the current payload exits or a hardware reset is actuated, a new program can be loaded instantly.

Because many subroutines are used in both the loader and the payload, duplicating them in the payload would be a waste of space. Using custom linker scripts and compiler flags, the payload is linked against the functions in the loader. Whenever a loader function is called from the payload, execution jumps to bootloader code, executes the requested actions and then returns to the payload.

\subsection{Drivers}

Several components required writing functions to make them easier to use. Some are as simple as writing a value to a specific memory location:

\begin{lstlisting}[
    language=c,
    label={lst:yarm-set-rgb-led},
    caption={Function to set the colour of an RGB LED on the Arty board}]
void set_rgb_led(size_t num, uint32_t color) {
	((volatile uint32_t*)ADDRESS_RGB_LEDS)[num] = color;
}
\end{lstlisting}

Others, like the function to write a character to the screen are more complicated and use further subroutines:

\begin{lstlisting}[
    language=c,
    label={lst:yarm-vga-putchar},
    caption={Function to write a character to the screen}]
void vga_putchar(screen_t *s, unsigned char c) {
	switch(c) {
		case '\n':
			set_cursor_pos(s, s->row + 1, 0);
			break;

		case '\b':
		// DEL
		case 0x7F:
			if (s->col > 0) {
				set_cursor_pos(s, s->row, s->col - 1);
			}
			if (c == 0x7F) {
				set_curr_char(s, ' ');
			}
			break;

		default:
			set_curr_char(s, c);
			set_cursor_pos(s, s->row, s->col + 1);
	}
}
\end{lstlisting}

\section{Testing}

\subsection{RISC-V Compliance Tests}

The RISC-V Compliance Test Suite~\cite{riscv-compliance} can be used to empirically confirm the correct functionality of a RISC-V processor. It consists of a series of programs that perform some operations related to a specific feature, then write some result data to a memory region. This memory region is then compared to a ``golden signature'', which was produced by a processor implementation that is known to be correct.

The initial implementation of the compliance tests uncovered several bugs in the processor core:

\begin{itemize}
 \item The bitshift instructions (SLL, SRL, SRA, etc.) must, according to the RISC-V standard, only use the lower 5 bits of the second operand as a shift offset. The implementation used all 31 bits instead, causing a test failure.
 \item Reading a signed value of a size less than 32 bits from memory would not perform proper sign extension. For example, reading a byte value of 0xFF (-1) would result in an expanded machine word of 0x0000_00FF (255) instead of 0xFFFF_FFFF.
 \item The \icode{SLTIU} (Set less than immediate; unsigned) instruction compares a given register with a constant provided as part of the instruction (the immediate). While the comparison is unsigned, the 12-bit immediate must be sign-extended as if it were a signed integer. The implementation wrongly assumed that the sign-extension should be unsigned as well.
 \item The Instruction Set Manual specifies exceptions that must be raised when a misaligned memory access occurs. These exceptions were not yet implemented, but since the compliance tests check for them, the functionality was added to make the tests pass.
\end{itemize}

Since these tests are easily automated, they were added to the GitLab Continuous Integration (CI)~\cite{gitlab-ci} configuration. Whenever a new Git commit is pushed to GitLab, the tests are run automatically, and any failures are reported to the responsible committer via email.

\subsection{Formal Verification}

While carefully selected simulation is useful to uncover bugs and to ensure they can't happen again (regression testing), it never offers complete certainty - it is simply impossible to manually cover all possibilities of inputs. With formal verification, the circuit under test is expressed using a mathematical model and an algorithm (a \emph{SAT solver}) ensures that certain manually-selected criteria are always fulfilled. A detailed explanation of the algorithm can be found in~\cite{symbiyosys-slides}.

As an example of formal verification, the core's ALU (\autoref{sec:core-alu}) has been extended with a formal verification definition, which can be seen in \autoref{lst:alu-formal}. Skipping over some helper logic in the beginning, the first statements add assumptions about the entity's input signals. These are rules that must be obeyed by designs using the component, otherwise the correct function cannot be guaranteed (and is indeed unproven).

Below these assumptions, a process is used to calculate the expected result whenever a calculation is requested. While most operations are implemented the same as in the main entity and thus have little value as a known-good comparison value, the bit-shift operations are implemented incrementally in the main ALU and directly in the verification; thus, the less resource-intensive ALU implementation can be confirmed to function exactly like this more expensive method.

Afterwards, the first assertions actually happen: these are the theorems that the formal verification suite will prove to be correct. The only two proven statements that are actually relevant to users are that when a result has been computed, it will equal the value computed using the aformentioned process, and that a computation will always finish eventually.

Finally, two more assertions are used to give hints to the formal verification algorithm, specifically the induction step. It is sometimes very difficult or even impossible to arrive at a successful induction; these assertions can be proven trivially, eliminating a number of potential scenarios that would otherwise make a successful complete proof impossible.

\lstinputlisting[
    style=vhdlstyle,
    label={lst:alu-formal},
    caption={Formal verification block for the ALU},
]{alu_formal.vhd}
\end{document}