• About
  • Projects
  • Contact
A blog of my experiences working with a Xeon Phi
Xeon Phi Blog
Recently I have been writing some code for ASCIIPlay and was contemplating multithreading implementations for some of its code. This then got me thinking on the topic of high core count programming. For many modern use cases, high core count programming is mostly done using paradigms such as CUDA and OpenCL. These are very robust and well documented tools that are the core of many projects; however, they have a steep learning curve due to their core architecture. As a result I was reminded of an old project by Intel to provide coprocessors that allowed you to use normal x86 coding paradigms; the Xeon Phi.
The Xeon Phi series has a strange story, starting with Intel's first attempts at a discrete video card, code named Larrabee. Intel had designed Larrabee to be a hybrid design between traditional GPGPUs and CPUs, allowing for advanced features to potentially be part of the cards far ahead of their time, such as real time ray tracing. Eventually however, Larrabee was canceled and its architecture was recycled to create the first generation Xeon Phi (what I will be using here). This first generation Xeon Phi, codenamed Knights Corner was built to be a coprocessor card that occupied a PCIe slot in another system, with later generations abandoning this structure in favor of having high core count CPUs instead. Due to this architecture only really existing for a single generation, documentation on this processor is relatively sparse. In addition to this, support for the Phi has been removed from Intel's OneAPI which has replaced Parallel Studio (The utility used to write and compile code for the Phi).
Given these caveats, I will have to find a number of solutions to work around the blocks in the way of using this card. This blog will be a location for me to document the steps I take towards getting code running on this Phi so that others can do the same if they want to play around with high core counts.
Getting the Phi
I managed to purchase a Xeon Phi 71s1p on eBay for ~$100CAD, unfortunately I had believed this to be a 16GB model, but it was in fact only 8GB. Had I known this, a 5120p can be had on eBay for closer to $60CAD; however, the 71s1p does have 4 more threads than the 5120p as well as higher core clocks, meaning that this wasn't a total waste. On receiving the card a few things can be noted; first is that this is a server card, meaning that the mounting brackets it came with can't be used in a normal computer. These were easy enough to remove, simply unscrewing a handful of phillips head screws I was able to remove them. This left me with a card that had no mounting hardware, meaning that I would have to make my own, but for now I can install it in a system to test.
Or so you would think, but in fact, no, the card still can not be used in a normal system as is. The Xeon Phi cards were sold in a variety of SKUs varying in memory capacity, core count, and importantly for a normal user, cooling format. The p at the end of the SKU denotes that this is actually a passively cooled card. This does not mean that the card is able to cool itself with no airflow, but instead relies on the high airflow that exists in a server to force air through its heatsink in order to stay within its operating temperature. Since I will be mounting this in a system that doesn't meet the Intel recommended airflow requirements, another solution will need to be found.
Thankfully, after a few hours running my 3D printer, I was able to print a fan shroud to which I could mount an 80mm fan in order to keep the Phi running cool. This isn't a normal 80mm fan however, the fan I chose to use is rated for 100CFM at full load, with plenty of static pressure to force air through the restrictive heatsink of the Phi. Other cooling options exist out there, with some even speculating watercooling, however this is a cheap and effective cooler that is suitably fit for purpose. Finally I designed and printed a replacement bracket so that I could properly mount the card to my case.
In order to run the phi until installation is complete I connected the cooling fan directly to 12v from the system's PSU. On first boot I had to enter into the system BIOS and enable above 4G decoding in order to allocate sufficient PCIe BAR space for the Phi. Following this, I could proceed onwards through OS installation. Since the Phi is a coprocessor by design, a host system is required to provide things like power, networking, and storage. In order to have an Operating System interface correctly with the card, some additional software is needed from Intel. Officially, Intel supported Windows, Red Hat, and Suse Enterprise Server, however there is documentation online from others who have managed to recompile the Intel packages to run in Ubuntu and some other systems.
My first attempt at running the Phi in a system was through Ubuntu Server 20.04; however, this was quickly cut short due to the code for the Phi having been removed in newer systems. Following this was a period of trying various OS's on the market, where I eventually settled on using Windows due to it being the only OS on which installation is well documented. Perhaps one day I will attempt to install a Linux OS to run this card, but for now Windows isn't a hindrance as most of my work will not be done on the host machine.
Installation is fairly straightforward in Windows, simply download the MPSS Windows release of your choice from Intel (so long as Intel keeps these pages online), and install following the documentation included. Once the utility is properly installed, you can interface with the card through the Windows command prompt (administrator privileges are required for some features). Using this access one of the first things I did was set up a script to poll the Phi for its die temperature and write it to a file.
By then writing a simple batch script to launch this through NodeJS at system boot I am able to have a fairly frequently updated temperature report from the card. On its own however, this script isn't particularly useful, so I coupled it with Fan Control, an open source fan curve customization utility. I created a simple curve that aims to keep the card inaudible at idle, as well as aiming to keep die temps below around 85C. This keeps it within Intel's specifications and should keep it from ever thermal throttling.
Once the card could be used without going deaf, the next course of action was to enable access to it, which was handled by again following the documentation and copying an SSH public key for the root user to the filesystem. Since the card uses RAM-FS and the file system is reset every time it is booted, the persistent files are actually stored on the host system. This is a very useful feature, as we can copy files to this local mount of the file system to keep them accessible across boots. In addition to this, the Phi also has the ability to mount NFS and CIFS shares, in theory.
The reality of mounting network shares on the Phi is not as cut and dry as one may hope. Since it runs its own OS, the Phi can't simply gain network access through the host OS and its programs, but instead relies on bridging the host network adapter with the virtual network adapter it presents to the host. Theoretically this should be a simple process after which the Phi will be able to route its traffic through the host systems network interface. In reality for myself however I was unable to get this to work, and as such will be leaving it for another day. As it stands now, the card is able to be controlled, I am able to login to it, and I can run commands.
Getting A Compiler
With the Phi set up the next thing to do was to figure out how to compile my own code to run on it. This turned out to be a relatively lengthy process, since although the Phi is based on the x86 architecture, it is different enough to not be able to run binaries compiled for typical x86 systems. As such I would need to find an alternative solution to compiling custom code. Intel used to provide the ability to cross compile normal C code to the Phi's architecture using a special flag. Unfortunately, they have removed this feature from their more recent compilers, and have also completely removed the ability to get the old compiler with this feature as well.
Following some research online, I was brought to this GitHub repo, which provides the source code for a version of GCC 5.1.1 that is capable of compiling code for the Phi. However, in order to compile and use this, we will also need to prepare a system with the necessary MPSS files. Essentially recreating my experiments in getting the Phi to run in Linux from before. Here I will document the exact process I followed in order to get compilation working.
  • Firstly, you need to install Ubuntu 14.10 on another system or within a VM. This is a fairly old version of Ubuntu that is no longer officially supported, but it also has the best documentation on setting up MPSS. I will try to find a better alternative for this at a later date.
  • Since Ubuntu 14.10 is so old, it is no longer officially supported by canonical. As such it is not possible to use apt out of the box, meaning that we will need to update the apt endpoints. This can be simply done using the following sed command, or by manually changing every instance of archive.ubuntu.com and security.ubuntu.com to old-releases.com in the /etc/at/sources.list file. Once apt is properly configured, I proceeded to update and upgrade as normal. Following which, the following packages were installed: linux-headers-generic, build-essential, git, alien, flex, libz-dev. These are required to get the compiler installed and working.
  • With the system installed and running the next step is to get mpss-modules installed. Start by cloning the mpss-modules git repo. Then cd into the cloned repo and run make and make install as normal, following which you will need to run depmod to update modules.
  • After installing the mpss-modules, the next step is to get MPSS from Intel. I chose to use MPSS 3.4.10 since it is closer to the version used in the repo, but it may be possible to use the last release without too much extra work. Then the RPMs from the Intel packages need to be converted to the DEB format used by Ubuntu. We use Alien for this task after which, we can go ahead and install the converted packages.
  • Now it is time to finally get and build GCC. First we have to clone the repo, after which some environment variables need to be set, and we run the autoconfiguration script. After which we simply make and install as normal. You may also wish to add the executable path to your export list in ~/.bashrc in order to be able to call the compiler simply using its name.
Once this configuration is done, it should be possible to compile code by simply calling the compiler using k1om-mpss-linux-gcc instead of gcc.
Compiling Code
Having installed a functional compiler, which was a lot more work than I made it seem, I was finally ready to write and compile my own programs. Obviously the first thing to try is compiling a standard helloworld program. This turned out to be simple enough since k1om-mpss-linux-gcc will include the Phi compiled standard libraries automatically. So all I had to do was write a normal C helloworld and change the compiler used to k1om-mpss-linux-gcc. I chose to end my executables with the .mic extension to make them easier to identify, but this is not required.
Upon running this code after copying it to the Phi using sftp, Hello World! is printed to the terminal. A great success! Having had such excellent success with this compilation, my next goal was to compile and run htop. If you haven't used linux too much, htop is a program that provides a graphical interface showing system resource utilization in the terminal. Having 61 cores and 244 threads on this card, running htop would definitely provide an interesting result, so I began by looking into the htop source code. Looking at it, I thought it would be simple enough to compile for my architecture; of course, this was wrong. Unlike my previous compilation which called gcc directly, the htop application is compiled using autoconfiguration, which means that I would need to specify some flags in order to get the compiler to use my cross compilation tools.
The first order of business was not to compile htop however, since htop has a requirement that needs to first be met in order to run on the Phi. ncurses is a library that allows for better terminal interface programming, by allowing a developer to update their terminal buffer without constantly rewriting the whole thing, as well as supporting character coloring. So in order to run htop on the Phi, I will need to cross compile ncurses. Unfortunately, ncurses is notorious for cross compilation, since part of its build process involves compiling a program that helps with the rest of the compilation. Since this program needs to be able to run on the compiling system, we have to compile it separately first and then tell the compiler to use that one rather than compile it again.
This compiles ncurses to be compatible with the xeon phi and installs all of the library files to /home/phi/k1om/usr. If your username is different or you want this to go somewhere else, just edit the prefix export to match. Using this you can tar the usr directory that has been generated and copy it over to the Phi in order to use the shared libraries. However, I will be statically linking my build of htop, meaning that this step can be, mostly, skipped. Unfortunately I did have to copy the usr/share folder into the Phi's root filesystem in order to get the terminfo database as I was unable to get hashing working. This is something I will take a look at later though. Next comes the compilation of htop for the card now that ncurses is built correctly.
The resulting htop executable located within the htop source directory can now be copied over to the Phi and executed locally. Firstly however, the terminfo location needs to be set on the card by running the following command after copying the usr/share/ directory built by ncurses.
Running htop after this will bring up htop showing CPU average only due to the high core count, however with a bit of configuration through the F2 menu, you can show each thread individually as shown here. Here are links to my compiled versions of ncurses and htop.
Expanding Language Availability
With so many successful compilations under my belt, I feel like I am really starting to get the hang of cross compiling. My next order of business is to attempt to expand the options I have for programming the Phi. The toolchain built earlier on is great for compiling C/C++, but doesn't allow for more modern languages. As such I decided that it was time to compile another language to be compatible with the Phi.
Since I'm not insane, I decided to start small; rather than attempt to get a modern language like Go or Rust functional, I have chosen to go with an interpreted language. There are many benefits to this approach, not least of which being, the ease of portability for these languages. Since interpreted languages use an interim form that is analyzed at run time, the only part that needs to be modified to function on the card is this interpreter. Libraries and other similar tools can be manually cross compiled as well, but as long as the core language functions, this shouldn't be too bad.
For my attempt, I have chosen to start with Python. Python is a popular scripting language that is simple to understand and has easy to grasp syntax. My hope is that by making Python available on this platform, it will open up the ability to work with these cards for average users. Mind you, the current version I have is almost certainly unstable, but is a good stepping stone in getting these cards out of the scrap heap and into the hands of people looking to learn.
I had foolishly thought at first that this would be a simple process, but was quickly proven wrong. The first thing I did was compile my chosen version of Python for my local system to verify my methodology. As soon as I did this, I realized I had a problem. Python has been maintaining many iterations of their interpreter throughout the years, and through these years have implemented... changes. Some of these changes rely on features of more modern versions of GCC, meaning that until someone updates this build chain for a more modern version of GCC, we will be forced to avoid these features, or use old sources. In my case, I have chosen to go for the latter to at least show a proof of concept for the idea. As such, my chosen version of Python is version 3.6.1; Python 3.6.x is still fairly well supported, and should be able to run just about anything we could want right now. Having settled on a version to compile, next came the methodology test... again.
Thankfully compilation and installation went fairly smoothly using the host system build chain, meaning now all that was left to do was compile using the Phi build chain, right? Of course, nothing is ever that simple, and in my case I immediately ran into issues. Cross compiling Python, as it turns out, requires a lot more configuration flagging than I had at first expected, and as such I had a lot of online sleuthing to do. However with all my flags configured, I was able to finally start some compilation. That is, until I ran into make errors related to the -lgcov flag. From what I could find online, libgcov is a utility that is used in parts of the GCC toolchain to verify coverity of the compiled code. In my case, it seemed that it was plainly missing from my build chain, and when I attempted to locate it on my system it appeared that it was never actually installed to the k1om build area. I also noticed that the only version of it on my system is a statically linked .a library. Thinking that I was missing a dynamic version I attempted a static compilation of Python, which only caused more headaches. In the end, after sufficient digging, I was able to find out that libgcov only compiles a static file which is used even when linking dynamically.
Having unraveled the mystery of libgcov, the first order of business was to copy the compiled libgcov from my original compilation of k1om-mpss-linux-gcc, located within the install directory, and place it in the k1om build chain directories. With this done, the -lgcov flag was able to successfully link to my version of libgcov and allow my build to succeed. I did also notice at this time that there were compilation issues regarding ncurses, so I added the linking used for the htop build and prayed for the best.
The result of running these commands is an _install directory located at /home/phi/k1om/usr/Python3.6 which can then be tar'd and copied over to the mic the same way as with all the other compiled code. Finally, the binary can be executed after installation and a Python shell should pop up. I have verified that a helloworld and simple math functions should work, but have not validated the functionality as tests are not run when cross compiling. Here is my compiled version of Python.
New Compiler Accessibility
My previously configured compiler setup was functional, but not ideal. It relies on users configuring their own system in order to compile, and the steps may potnetially break in the future. As such I have built a Docker image containing a functional version of the compiler. Simply pull the image and you should be able to compile using the toolchain located within it. I will go into more depth on how I did this for users that are interested in recreating this or who are looking to accomplish something similar.
The general process for preparing this docker image was fairly similar to that followed to create the initial compiler toolchain in Ubuntu 14.10. Instead however, this is based on Ubuntu 16.04 with hopes to update to 20.04; providing a modern Docker build. The host system in which the docker image was generated was however based on 16.04 for ease of development using the specific kernel. Speaking of the kernel, the most recent version of mpss-modules I was able to find that worked well was this one patched for kernel 4.10.11.
My first step was to create a VM running Ubuntu 16.04 as this version is both supported by Docker and is built on a kernel close to the version we are looking to use. Notably however it is not the exact version we are looking for, which means that this needs to be changed manually. This is done by pulling the kernel packages from Canonical and installing them manually, following which the system is reloaded.
Following a reboot, running uname -a will report that the system is running kernel 4.10.11, allowing use to proceed to the next stage. In my case I installed the compiler toolchain locally first to verify that it works, and then built my Docker container afterwards. I will only be listing the steps to getting a working Dockerfile as an install on 16.04 is done by running the same commands. Firstly the container needs to be updated, and build tools need to be installed. Notably however, we will not be installing generic linux headers, since we have already downloaded them.
Following this the Docker image needs a copy of the linux headers installed for its own use, I chose to download them again but it should be fine to copy the ones downloaded earlier.
Now the steps become very similar, we need to install mpss-modules from github, notably this time the Makefile requires a MIC_CARD_ARCH flag to be set.
Since the version of mpss-modules we pulled is more up to date, we can now use the final release version of mpss from Intel, 3.8.6. Which we clone and install using alien to convert the RPM packages to DEB as before.
Finally k1om-mpss-linux-gcc needs to be compiled and bashrc needs to be updated. Since mpss 3.8.6 is being used a few path locations need to change in the configuration.
After this the container can be connected to and all build tools should be accessible like with the previous configuration. If you prefer you can download the Docker file.
Custom Rackmount Case