dipl/sections/soc/soc.tex

169 lines
13 KiB
TeX

\documentclass[../../Diplomschrift.tex]{subfiles}
\begin{document}
\part{Meta}
\section{History}
The project started out with the desire to build a CPU from scratch. Examples such as The NAND Game\cite{nandgame} and Ben Eater's Breadboard Computer series\cite{breadboard_computer} served as inspirations and guidance during development.
At first, a design similar to Ben Eater's consisting solely of discrete integrated circuits was considered, but soon discarded in favor of an FPGA-based design. Designing the logic alone was a difficult task, implementing it in discrete hardware would have pushed the project far over the allotted maximum development time.
RISC-V was chosen as the instruction set architecture for the processor. Its modular design with a very small base instruction set make it easy to implement a basic processor that is still fully compatible with existing software and toolchains.
As a starting point, a Terasic DE0 development board\footnote{\url{https://www.terasic.com.tw/cgi-bin/page/archive.pl?No=364}} containing an Altera Cyclone III\footnote{\url{https://www.intel.com/content/www/us/en/products/programmable/fpga/cyclone-iii.html}} FPGA was borrowed from the school's inventory. It was used to implement a first version of the core.
The only method of synthesis for Altera devices is to use the proprietary Quartus IDE. However, the last version of Quartus to support the Cyclone III series of FPGAs (version 13.1) had already been out of date for several years at the start of the project. Because of this and the increasing resource demand of the developing core, an Arty A7-35T development board\footnote{\url{https://store.digilentinc.com/arty-a7-artix-7-fpga-development-board-for-makers-and-hobbyists/}} with a Xilinx Artix-7\footnote{\url{https://www.xilinx.com/products/silicon-devices/fpga/artix-7.html}} FPGA was ordered from Digilent.
The two FPGAs compare as follows:
\begin{tabular}{l r r}
& Altera EP3C16 & Xilinx XC7A35T \\
Logic Elements & 15000 & 33280 \\
Multipliers & 56 & 90 \\
Block RAM (kb) & 504 & 1800 \\
PLLs & 4 & 5 \\
Global clocks & 20 & 32 \\
\end{tabular}
The periphery on the development boards:
\begin{tabular}{l|r|r}
& Terasic DE0 & Digilent Arty A7-35T \\
Switches & 10 & 4 \\
Buttons & 3 & 4 \\
LEDs & 10 + 4x 7-segment & 4 + 3 RGB \\
GPIOs & 2x 36 & 4x PMOD + chipKIT \\
Memory & 8MB SDRAM & 256MB DDR3L \\
Others & SD card, VGA & Ethernet \\
\end{tabular}
While the Digilent board offers fewer IO options, the DDR3 memory can be interfaced using Free memory cores and allows for much larger programs to be loaded, possibly even a full operating system. The missing VGA port has been substituted by a HDMI-compatible DVI interface that is accessible through one of the high-speed PMOD connectors.
\section{Tooling}
FPGA design is done using a Hardware Description Language (HDL). The two most well-known HDLs are Verilog and VHDL (VHSIC (Very high speed integrated circuit) HDL). As part of our studies at HTL, we exclusively worked with VHDL. For this reason, and because VHDL offers a better type system, it was chosen as the language of choice for the project.
\subsection{Vendor Tools}
The conventional way to work with FPGA designs is to use the FPGA vendor's development solution for simulation, synthesis and place-and-route. All of these tools are proprietary software specialized to a certain FPGA manufacturer, so a change of hardware also requires changing to a completely different software solution.
Vendor tools are usually free-of-charge for basic usage, but this also means there is no guaranteed support. During the development of this project, several bugs and missing features were found in vendor tools that required workarounds.
\subsection{Free Software Tools}
A somewhat recent development is the creation of Free Software\footnotemark{} FPGA toolchains. A breakthrough was achieved by Claire (formerly Clifford) Wolf in 2013 with yosys\cite{yosys-paper, yosys}, a feature-complete Verilog synthesis suite for Lattice's \texttt{iCE40} FPGA series.
\footnotetext{``Free Software'' refers to software that grants its user the freedom to share, study and modify it - see \url{https://www.fsf.org/about/what-is-free-software}.}
Since then, both yosys and place-and-route tools like nextpnr\cite{nextpnr} have matured, however Lattice's iCE40 and ECP5 remained the only supported FPGA architectures for place-and-route.
Thus, two obstacles remained for Free toolchains to be viable for this project: synthesizing \emph{from} VHDL code and synthesizing \emph{to} Artix-7 FPGAs. During the development of the project, both of these were solved: Tristan Gingold released ghdlsynth-beta\parencite*{ghdlsynth-beta}, a bridge between GHDL\cite{ghdl} and yosys allowing VHDL to be synthesized just the same as Verilog, and Dave Shah added Xilinx support to nextpnr\cite{nextpnr-xilinx}. The latter was preceded by many months of volunteer work reverse-engineering the Xilinx bitstream format as part of \textit{Project X-Ray}\parencite*{prjxray}.
With these two pieces in place, the project was switched over to a completely Free toolchain, removing any depencies on vendor tools:
\begin{itemize}
\item yosys, with ghdl as a frontend for processing VHDL, is used to synthesize the design
\item nextpnr-xilinx, together with the Project X-Ray database, is used for place-and-route
\item tools from Project X-Ray are used to convert the routed design to a bitstream
\item openFPGALoader is used to transfer the bitstream to the FPGA via JTAG
\end{itemize}
\section{Peripherals}
\subsection{UART}
% TODO
\subsection{DVI graphics}
The graphics submodule consists of a VGA timing generator, a text renderer with a font ROM, and a DVI encoder frontend:
\begin{figure}[h]
\includegraphics[width=\textwidth]{graphics.png}
\caption{Block diagram of the video core}
\end{figure}
\subsubsection{VGA timing}
The timing of VGA signals dates back to analog monitors. Even though this original purpose is only very rarely used nowadays, the timing remained the same for analog and digital DVI all the way to modern HDMI.
In analog screens, the electron beams (one for each primary color red, green and blue) scan across the screen a single horizontal line at a time while being modulated by the color values, resulting in a continuous mixture of all three components. When a beam reaches the end of a scanline, it continues outside the visible area for a small distance (the ``Front Porch''), is then sent to the beginning of the next line by a pulse of the hsync (Horizontal Sync) signal, and draws the next line after another short off-screen period (the ``Back Porch'').
The same applies to vertical timings: after the beam reaches the end of the last line, a few off-screen Front Porch lines follow, then a pulse of the vsync (Vertical Sync) signal sends the beam to the top of the screen, where the first line of the next frame is drawn after several invisible Back Porch lines.
\begin{figure}[h]
\includegraphics[width=\textwidth]{vga_timing.png}
\caption{Diagram of VGA timing intervals}
\end{figure}
The VGA timing module generates these hsync and vsync signals, along with a blanking signal (active during any front porch, sync and back porch) and, while in the visible area (i.e. not blanking), the row and column of the current pixel relative to the visible area.
\subsubsection{Text renderer}
The text renderer converts a logical representation of a character, such as its ASCII code (henceforth referred to as its \textit{codepoint}) to a visual representation (a \textit{glyph}). This conversion is achieved using a \textit{font}, a mapping of codepoints to glyphs.
\begin{figure}[h]
\includegraphics[width=0.7\textwidth]{text_renderer.png}
\caption{Block diagram of the text renderer}
\end{figure}
First, the current pixel coordinate (created by the VGA timing generator) is split up into two parts: the character index, which specifies the on-screen character the pixel belongs to, and the offset of the pixel in this character. The character index is passed to the text RAM, which contains the codepoint for each on-screen character. This codepoint, along with the pixel offset, is looked up in the font ROM to determine the color of the pixel.
\subsubsection{TMDS encoder}
DVI and HDMI are serial digital transmission standards. Three data lines (corresponding to red, green, and blue channels) along with a clock line transmit all color information as well as synchronization signals. The encoding used for these signals is Transition-minimized differential signaling (TMDS). It is a kind of 8b/10b encoding (transforming every 8-bit chunk of data into a 10-bit chunk) that is designed to minimize the number of changes of the output signal.
\subsection{Ethernet}
The Arty development board contains an RJ-45 Ethernet jack connected to an Ethernet PHY, which exposes a standardized media-independent interface (MII) to the FPGA. The LiteEth core\cite{liteeth}, which is released under a Free Software license, is used to integrate the Ethernet interface into the SoC.
\subsection{WS2812 driver}
A hardware driver for WS2812 serially-addressable RGB LEDs is also included in the SoC. It was developed independently as part of the curriculum at HTL and later incorporated into the SoC.
\begin{figure}[h]
\includegraphics[width=\textwidth]{ws2812.png}
\caption{Block diagram of the WS2812 driver}
\end{figure}
\begin{figure}[h]
\centering\includegraphics[width=0.5\textwidth]{ws2812_timing.png}
\caption{Timing diagram for the WS2812 serial protocol}
\label{fig:ws2812_timing}
\end{figure}
The driver is designed to be attached to external circuitry that provides color data for any given LED index (address). This can either be discrete logic that generates the color value from the address directly, or a memory that stores a separate color value for each address.
The LEDs are controlled using a simple one-wire serial protocol. After a reset (long period of logic 0), the data for all LEDs is transmitted serially in one single blob. Each LED consumes and stores the first 24 bits of the stream and applies them as its color value (8 bits each for red, green, blue), all following bits are passed through unmodified. The second LED thus uses the first 24 bits of the stream it receives, but since the first LED already dropped its data, these are actually the second set of 24 bits of the source data.
Every bit is encoded as a period of logic 1, followed by a period of logic 0. The timing of these sections determines the value, see \ref{fig:ws2812_timing}.
The exact timing differs between models, so all periods can be customized using generics in the VHDL entity.
\subsection{DRAM}
% TODO
\subsection{External Bus}
Bridging the internal SoC bus with the external peripheral bus requires a few steps. For one, the external data bus is bidirectional, so tri-state outputs must be used on the FPGA. In addition, the internal bus arbitrates components using addresses alone, while the external bus uses chip enable signals and overlapping address spaces.
Due to a mistake in the adapter board layout, the nibbles of the address and data buses are reversed (MSB to LSB are pins 7 to 0 on the FPGA, but 3 to 0 followed by 7 to 4 on the board). Thanks to the completely arbitrary mapping of FPGA pins, this can be mitigated without using any additional resources.
\section{Testing}
\subsection{RISC-V Compliance Tests}
The RISC-V Compliance Test Suite\cite{riscv-compliance} can be used to empirically confirm the correct functionality of a RISC-V processor. It consists of a series of programs that perform some operations related to a specific feature, then write some result data to a memory region. This memory region is then compared to a ``golden signature'', which was produced by a processor implementation that is known to be correct.
The initial implementation of the compliance tests uncovered several bugs in the processor core:
\begin{itemize}
\item The bitshift instructions (SLL, SRL, SRA, etc.) must, according to the RISC-V standard, only use the lower 5 bits of the second operand as a shift offset. The implementation used all 31 bits instead, causing a test failure.
\item Reading a signed value of a size less than 32 bits from memory would not perform proper sign extension. For example, reading a byte value of 0xFF (-1) would result in an expanded machine word of 0x0000_00FF (255) instead of 0xFFFF_FFFF.
\item The \icode{SLTIU} (Set less than immediate; unsigned) instruction compares a given register with a constant provided as part of the instruction (the immediate). While the comparison is unsigned, the 12-bit immediate must be sign-extended as if it were a signed integer. The implementation wrongly assumed that the sign-extension should be unsigned as well.
\item The Instruction Set Manual specifies exceptions that must be raised when a misaligned memory access occurs. These exceptions were not yet implemented, but since the compliance tests check for them, the functionality was added to make the tests pass.
\end{itemize}
Since these tests are easily automated, they were added to the GitLab Continuous Integration (CI) configuration. Whenever a new git commit is pushed to GitLab, the tests are run automatically, and any failures are reported to the responsible committer via email.
\end{document}