Preface
Recently I have been writing some code for
ASCIIPlay
and was contemplating multithreading implementations for some of its code. This then got me thinking
on the topic of high core count programming. For many modern use cases, high core count programming
is mostly done using paradigms such as CUDA and OpenCL. These are very robust and well documented
tools that are the core of many projects; however, they have a steep learning curve due to their core architecture.
As a result I was reminded of an old project by Intel to provide coprocessors that allowed you to use
normal x86 coding paradigms; the Xeon Phi.
The Xeon Phi series has a strange story, starting with Intel's first attempts at a discrete video card,
code named
Larrabee.
Intel had designed Larrabee to be a hybrid design between traditional GPGPUs and CPUs, allowing for advanced
features to potentially be part of the cards far ahead of their time, such as real time ray tracing.
Eventually however, Larrabee was canceled and its architecture was recycled to create the first
generation Xeon Phi (what I will be using here). This first generation Xeon Phi, codenamed Knights Corner
was built to be a coprocessor card that occupied a PCIe slot in another system, with later generations
abandoning this structure in favor of having high core count CPUs instead. Due to this architecture
only really existing for a single generation, documentation on this processor is relatively sparse. In addition
to this, support for the Phi has been removed from Intel's OneAPI which has replaced Parallel Studio (The utility used
to write and compile code for the Phi).
Given these caveats, I will have to find a number of solutions to work around the blocks in the way of
using this card. This blog will be a location for me to document the steps I take towards getting
code running on this Phi so that others can do the same if they want to play around with high core counts.
Getting the Phi
I managed to purchase a Xeon Phi 71s1p on eBay for ~$100CAD, unfortunately I had believed this to be a 16GB
model, but it was in fact only 8GB. Had I known this, a 5120p can be had on eBay for closer to $60CAD; however,
the 71s1p does have 4 more threads than the 5120p as well as higher core clocks, meaning that this wasn't a total waste.
On receiving the card a few things can be noted; first is that this is a server card, meaning that the mounting
brackets it came with can't be used in a normal computer. These were easy enough to remove, simply unscrewing a
handful of phillips head screws I was able to remove them. This left me with a card that had no mounting hardware,
meaning that I would have to make my own, but for now I can install it in a system to test.
Or so you would think, but in fact, no, the card still can not be used in a normal system as is. The
Xeon Phi cards were sold in a variety of SKUs varying in memory capacity, core count, and importantly
for a normal user, cooling format. The p at the end of the SKU denotes that this is actually a passively
cooled card. This does not mean that the card is able to cool itself with no airflow, but instead
relies on the high airflow that exists in a server to force air through its heatsink in order to stay
within its operating temperature. Since I will be mounting this in a system that doesn't meet the
Intel recommended airflow requirements, another solution will need to be found.
Thankfully, after a few hours running my 3D printer, I was able to print a
fan shroud to which I could mount an 80mm fan in order to keep the Phi running cool. This isn't
a normal 80mm fan however, the fan I chose to use is rated for 100CFM at full load, with plenty of
static pressure to force air through the restrictive heatsink of the Phi. Other cooling options exist
out there, with some even speculating watercooling, however this is a cheap and effective cooler that
is suitably fit for purpose. Finally I designed and printed a
replacement bracket
so that I could properly mount the card to my case.
Installation
In order to run the phi until installation is complete I connected the cooling fan directly to 12v
from the system's PSU. On first boot I had to enter into the system BIOS and enable above 4G decoding
in order to allocate sufficient PCIe BAR space for the Phi. Following this, I could proceed onwards
through OS installation. Since the Phi is a coprocessor by design, a host system is required to provide
things like power, networking, and storage. In order to have an Operating System interface correctly with
the card, some additional software is needed from Intel. Officially, Intel supported Windows, Red Hat, and Suse
Enterprise Server, however there is documentation online from others who have managed to recompile the
Intel packages to run in Ubuntu and some other systems.
My first attempt at running the Phi in a system was through Ubuntu Server 20.04; however, this was
quickly cut short due to the code for the Phi having been removed in newer systems. Following this
was a period of trying various OS's on the market, where I eventually settled on using Windows due
to it being the only OS on which installation is well documented. Perhaps one day I will attempt to
install a Linux OS to run this card, but for now Windows isn't a hindrance as most of my work will
not be done on the host machine.
Installation is fairly straightforward in Windows, simply
download the MPSS Windows release of your choice from Intel (so long as Intel keeps these pages online),
and install following the documentation included. Once the utility is properly installed, you can
interface with the card through the Windows command prompt (administrator privileges are required for some features).
Using this access one of the first things I did was set up a script to poll the Phi for its die temperature
and write it to a file.
By then writing a simple batch script to launch this through NodeJS at system boot I am able to
have a fairly frequently updated temperature report from the card. On its own however, this script
isn't particularly useful, so I coupled it with
Fan Control, an open source fan curve customization utility. I created a simple curve that aims to
keep the card inaudible at idle, as well as aiming to keep die temps below around 85C. This keeps
it within Intel's specifications and should keep it from ever thermal throttling.
Once the card could be used without going deaf, the next course of action was to enable access to it,
which was handled by again following the documentation and copying an SSH public key for the root user
to the filesystem. Since the card uses RAM-FS and the file system is reset every time it is booted,
the persistent files are actually stored on the host system. This is a very useful feature, as we can
copy files to this local mount of the file system to keep them accessible across boots. In addition
to this, the Phi also has the ability to mount NFS and CIFS shares, in theory.
The reality of mounting network shares on the Phi is not as cut and dry as one may hope. Since it
runs its own OS, the Phi can't simply gain network access through the host OS and its programs, but
instead relies on bridging the host network adapter with the virtual network adapter it presents to
the host. Theoretically this should be a simple process after which the Phi will be able to route
its traffic through the host systems network interface. In reality for myself however I was unable
to get this to work, and as such will be leaving it for another day. As it stands now, the card is
able to be controlled, I am able to login to it, and I can run commands.
Getting A Compiler
With the Phi set up the next thing to do was to figure out how to compile my own code to run on it. This
turned out to be a relatively lengthy process, since although the Phi is based on the x86 architecture,
it is different enough to not be able to run binaries compiled for typical x86 systems. As such I would
need to find an alternative solution to compiling custom code. Intel used to provide the ability to
cross compile normal C code to the Phi's architecture using a special flag. Unfortunately, they have removed
this feature from their more recent compilers, and have also completely removed the ability to get
the old compiler with this feature as well.
Following some research online, I was brought to
this GitHub repo, which provides the source code for a version of GCC 5.1.1 that is capable of
compiling code for the Phi. However, in order to compile and use this, we will also need to prepare
a system with the necessary MPSS files. Essentially recreating my experiments in getting the Phi
to run in Linux from before. Here I will document the exact process I followed in order to get
compilation working.
-
Firstly, you need to install Ubuntu 14.10 on another system or within a VM. This is a fairly
old version of Ubuntu that is no longer officially supported, but it also has the best documentation
on setting up MPSS. I will try to find a better alternative for this at a later date.
-
Since Ubuntu 14.10 is so old, it is no longer officially supported by canonical. As such it is
not possible to use apt out of the box, meaning that we will need to update the apt endpoints.
This can be simply done using the following sed command, or by manually changing every instance
of archive.ubuntu.com and security.ubuntu.com to old-releases.com in the /etc/at/sources.list file.
Once apt is properly configured, I proceeded to update and upgrade as normal. Following which,
the following packages were installed: linux-headers-generic, build-essential, git, alien, flex, libz-dev.
These are required to get the compiler installed and working.
-
With the system installed and running the next step is to get mpss-modules installed. Start
by cloning the mpss-modules git repo.
Then cd into the cloned repo and run make and make install as normal, following which you will need
to run depmod to update modules.
-
After installing the mpss-modules, the next step is to get MPSS from Intel. I chose to use
MPSS 3.4.10 since it is closer to the version used in the repo, but it may be possible to
use the last release without too much extra work. Then the RPMs from the Intel packages
need to be converted to the DEB format used by Ubuntu. We use Alien for this task after
which, we can go ahead and install the converted packages.
-
Now it is time to finally get and build GCC. First we have to clone the repo, after which
some environment variables need to be set, and we run the autoconfiguration script. After
which we simply make and install as normal. You may also wish to add the executable path
to your export list in ~/.bashrc in order to be able to call the compiler simply using
its name.
Once this configuration is done, it should be possible to compile code by simply calling the
compiler using k1om-mpss-linux-gcc instead of gcc.
Compiling Code
Having installed a functional compiler, which was a lot more work than I made it seem, I was finally
ready to write and compile my own programs. Obviously the first thing to try is compiling a standard
helloworld program. This turned out to be simple enough since k1om-mpss-linux-gcc will include
the Phi compiled standard libraries automatically. So all I had to do was write a normal C
helloworld and change the compiler used to k1om-mpss-linux-gcc. I chose to end my executables
with the .mic extension to make them easier to identify, but this is not required.
Upon running this code after copying it to the Phi using sftp, Hello World! is printed to the terminal.
A great success! Having had such excellent success with this compilation, my next goal was to compile and
run htop. If you haven't used linux too much, htop is a program that provides a graphical interface
showing system resource utilization in the terminal. Having 61 cores and 244 threads on this card,
running htop would definitely provide an interesting result, so I began by looking into the
htop source code. Looking at it, I thought it would be simple enough to compile for my architecture;
of course, this was wrong. Unlike my previous compilation which called gcc directly, the htop
application is compiled using autoconfiguration, which means that I would need to specify some
flags in order to get the compiler to use my cross compilation tools.
The first order of business was not to compile htop however, since htop has a requirement that needs
to first be met in order to run on the Phi.
ncurses is a library that allows for better terminal interface programming, by allowing a developer
to update their terminal buffer without constantly rewriting the whole thing, as well as supporting
character coloring. So in order to run htop on the Phi, I will need to cross compile ncurses. Unfortunately,
ncurses is notorious for cross compilation, since part of its build process involves compiling a
program that helps with the rest of the compilation. Since this program needs to be able to run on
the compiling system, we have to compile it separately first and then tell the compiler to use
that one rather than compile it again.
This compiles ncurses to be compatible with the xeon phi and installs all of the library files to
/home/phi/k1om/usr. If your username is different or you want this to go somewhere else, just edit
the prefix export to match. Using this you can tar the usr directory that has been generated and
copy it over to the Phi in order to use the shared libraries. However, I will be statically
linking my build of htop, meaning that this step can be, mostly, skipped. Unfortunately I did have to
copy the usr/share folder into the Phi's root filesystem in order to get the terminfo database
as I was unable to get hashing working. This is something I will take a look at later though. Next
comes the compilation of htop for the card now that ncurses is built correctly.
The resulting htop executable located within the htop source directory can now be copied over
to the Phi and executed locally. Firstly however, the terminfo location needs to be set on the
card by running the following command after copying the usr/share/ directory built by ncurses.
Running htop after this will bring up htop showing CPU average only due to the high core count,
however with a bit of configuration through the F2 menu, you can show each thread individually
as shown here. Here are links to my compiled versions of
ncurses
and
htop.
Expanding Language Availability
With so many successful compilations under my belt, I feel like I am really starting to get the
hang of cross compiling. My next order of business is to attempt to expand the options I have
for programming the Phi. The toolchain built earlier on is great for compiling C/C++, but doesn't
allow for more modern languages. As such I decided that it was time to compile another language
to be compatible with the Phi.
Since I'm not insane, I decided to start small; rather than attempt to get a modern language like Go
or Rust functional, I have chosen to go with an interpreted language. There are many benefits to this
approach, not least of which being, the ease of portability for these languages. Since interpreted
languages use an interim form that is analyzed at run time, the only part that needs to be modified
to function on the card is this interpreter. Libraries and other similar tools can be manually cross
compiled as well, but as long as the core language functions, this shouldn't be too bad.
For my attempt, I have chosen to start with Python. Python is a popular scripting language that
is simple to understand and has easy to grasp syntax. My hope is that by making Python available
on this platform, it will open up the ability to work with these cards for average users. Mind
you, the current version I have is almost certainly unstable, but is a good stepping stone in
getting these cards out of the scrap heap and into the hands of people looking to learn.
I had foolishly thought at first that this would be a simple process, but was quickly proven wrong.
The first thing I did was compile my chosen version of Python for my local system to verify my
methodology. As soon as I did this, I realized I had a problem. Python has been maintaining many
iterations of their interpreter throughout the years, and through these years have implemented...
changes. Some of these changes rely on features of more modern versions of GCC, meaning that until
someone updates this build chain for a more modern version of GCC, we will be forced to avoid these
features, or use old sources. In my case, I have chosen to go for the latter to at least show a
proof of concept for the idea. As such, my chosen version of Python is version 3.6.1; Python 3.6.x is
still fairly well supported, and should be able to run just about anything we could want right now.
Having settled on a version to compile, next came the methodology test... again.
Thankfully compilation and installation went fairly smoothly using the host system build chain,
meaning now all that was left to do was compile using the Phi build chain, right? Of course,
nothing is ever that simple, and in my case I immediately ran into issues. Cross compiling Python,
as it turns out, requires a lot more configuration flagging than I had at first expected, and as
such I had a lot of online sleuthing to do. However with all my flags configured, I was able to
finally start some compilation. That is, until I ran into make errors related to the -lgcov flag.
From what I could find online, libgcov is a utility that is used in parts of the GCC toolchain to verify
coverity of the compiled code. In my case, it seemed that it was plainly missing from my build chain,
and when I attempted to locate it on my system it appeared that it was never actually installed to the
k1om build area. I also noticed that the only version of it on my system is a statically linked .a
library. Thinking that I was missing a dynamic version I attempted a static compilation of Python,
which only caused more headaches. In the end, after sufficient digging, I was able to find out that
libgcov only compiles a static file which is used even when linking dynamically.
Having unraveled the mystery of libgcov, the first order of business was to copy the compiled
libgcov from my original compilation of k1om-mpss-linux-gcc, located within the install directory,
and place it in the k1om build chain directories. With this done, the -lgcov flag was able to
successfully link to my version of libgcov and allow my build to succeed. I did also notice at this
time that there were compilation issues regarding ncurses, so I added the linking used for the htop
build and prayed for the best.
The result of running these commands is an _install directory located at /home/phi/k1om/usr/Python3.6
which can then be tar'd and copied over to the mic the same way as with all the other compiled
code. Finally, the binary can be executed after installation and a Python shell should pop up.
I have verified that a helloworld and simple math functions should work, but have not validated
the functionality as tests are not run when cross compiling. Here is my compiled version of
Python.
New Compiler Accessibility
My previously configured compiler setup was functional, but not ideal. It relies on users configuring
their own system in order to compile, and the steps may potnetially break in the future. As such
I have built a
Docker image containing a functional version of the compiler. Simply pull the image and you
should be able to compile using the toolchain located within it. I will go into more depth on how
I did this for users that are interested in recreating this or who are looking to accomplish
something similar.
The general process for preparing this docker image was fairly similar to that followed to create
the initial compiler toolchain in Ubuntu 14.10. Instead however, this is based on Ubuntu 16.04 with
hopes to update to 20.04; providing a modern Docker build. The host system in which the docker
image was generated was however based on 16.04 for ease of development using the specific kernel.
Speaking of the kernel, the most recent version of mpss-modules I was able to find that worked well
was
this one patched for kernel 4.10.11.
My first step was to create a VM running Ubuntu 16.04 as this version is both supported by Docker
and is built on a kernel close to the version we are looking to use. Notably however it is not the
exact version we are looking for, which means that this needs to be changed manually. This is done
by pulling the kernel packages from Canonical and installing them manually, following which the system
is reloaded.
Following a reboot, running uname -a will report that the system is running kernel 4.10.11, allowing
use to proceed to the next stage. In my case I installed the compiler toolchain locally first to
verify that it works, and then built my Docker container afterwards. I will only be listing the
steps to getting a working Dockerfile as an install on 16.04 is done by running the same commands.
Firstly the container needs to be updated, and build tools need to be installed. Notably however,
we will not be installing generic linux headers, since we have already downloaded them.
Following this the Docker image needs a copy of the linux headers installed for its own use,
I chose to download them again but it should be fine to copy the ones downloaded earlier.
Now the steps become very similar, we need to install mpss-modules from github, notably this time
the Makefile requires a MIC_CARD_ARCH flag to be set.
Since the version of mpss-modules we pulled is more up to date, we can now use the final release
version of mpss from Intel, 3.8.6. Which we clone and install using alien to convert the RPM
packages to DEB as before.
Finally k1om-mpss-linux-gcc needs to be compiled and bashrc needs to be updated. Since mpss
3.8.6 is being used a few path locations need to change in the configuration.
After this the container can be connected to and all build tools should be accessible like with
the previous configuration. If you prefer you can download the
Docker file.