Speeding up calculations in .NET with CUDA

For those who have been living under a rock the last 10 years or so, utilizing the phenomenal parallelism of the graphics processor for some serious number crunching is a big thing these days. For traditional CPU bound developers the gap between the two worlds is not insignificant however. I look forward to the day when standardization of these features have reached a point where it is included in the standard libraries, a tool that we simply take out of the toolbox when the job requires it.

In the meantime there has been excellent work done by various groups to attempt to bridge that divide. As a .NET developer wanting to get my hands dirty with this exciting technology (albeit somewhat belatedly) the natural choice was to use the excellent CUDAfy.NET framework licensed under LGPL. An alternative framework that I have yet to test is Alea GPU. Surprisingly CUDAfy.NET supports both CUDA and OpenCL, the two major standards today (CUDA being proprietary to NVIDIA).

For my test application I went back to an old favorite of mine, rendering the Mandelbrot set fractal. Rendering this fractal is a thankful task to parallelize since the calculations for every pixel are entirely independent and can be split up as many ways as there are pixels to render. I also created a plugin architecture for the rendering application that would dynamically load different visualizations, be they Mandelbrot, Julia or something else. Two sets of visualization plugin implementations were implemented, one for the CPU and one for the GPU.

Another requirement was that when viewing the Mandelbrot set the Julia set should update instantly in a small PiP window. This makes sense since the Mandelbrot set is essentially a map of Julia sets in that each point on the complex plane corresponds to a unique set of input parameters for the Julia set. To this end I implemented a simple UI in WPF, a framework library as well as implementations of the Mandelbrot and Julia fractal plugins both for the CPU and for the GPU. The UI or the framework knows nothing about CUDAfy.NET.

To test whether a point belongs to the Mandelbrot set you have to iterate over a complex number variable a number of times until it either escapes past a certain limit or reaches a maximum iteration count (initially 256 in my implementation). The actual number is dynamically determined in the application by the zoom depth you find yourself at. A full screen rendering might thus require 1920 x 1152 x 256 = 566 million iterations in the worst case (if no points escape early). When zoomed in deeply the number of iterations necessary would grow with the increased maximum iteration count. Its a testament to the speed of our current generation of computers that this takes at most a few seconds running on one CPU core. Performing the same number of calculations on the GPU took only 1/20 of that time (1/200 if the calculations were done in floating point instead of double precision).

There are a number of known algorithmic optimizations you can employ when rendering the Mandelbrot fractal such as recursively testing the edges of rectangular regions. If the entire border is of a uniform color then you fill that region with that color. This is possible because the Mandelbrot set is a connected set (there are no islands) and allows us to skip a great number of calculations. I will leave that for as a possible future improvement.

Since initializing CUDAfy.NET uses the same code for both the Julia and Mandelbrot sets (and any future fractal plugins that will be implemented for the GPU) I created an extension method that would “cudafy” the instance on which its used.

public static class CudaExtensions
{
  
  private static void LoadTypeModule(GPGPU gpu, Type typeToCudafy)
  {
    var appFolder = AppDomain.CurrentDomain.BaseDirectory;
    var typeModulePath = Path.Combine(appFolder, typeToCudafy.Name + ".cdfy");
    var cudaModule = CudafyModule.TryDeserialize(typeModulePath);
    if (cudaModule == null || !cudaModule.TryVerifyChecksums())
    {
      cudaModule = CudafyTranslator.Cudafy(new[] { typeToCudafy });
      cudaModule.Serialize();
    }
    gpu.LoadModule(cudaModule, false);
  }

  public static void Execute<T>(this T instance, string kernel, int[] levels, byte[] colors, byte[] palette, RegionDefinition definition) where T: IFractal
  {
    CudafyModes.Target = eGPUType.Cuda;
    CudafyModes.DeviceId = 0;
    CudafyTranslator.Language = CudafyModes.Target == eGPUType.OpenCL ? eLanguage.OpenCL : eLanguage.Cuda;
    var deviceCount = CudafyHost.GetDeviceCount(CudafyModes.Target);
    if (deviceCount == 0) return;
    var gpu = CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId);
    LoadTypeModule(gpu, instance.GetType());
    double[] parameters = null;
    parameters = (instance.Parameters as object).ToDoubleArray();
    var devLevels = gpu.Allocate<int>(levels.Length);
    var devColors = gpu.Allocate<byte>(colors.Length);
    var devPalette = gpu.Allocate<byte>(palette.Length);
    var devParameters = gpu.Allocate<double>(parameters.Length);
    gpu.CopyToDevice(palette, devPalette);
    gpu.CopyToDevice(parameters, devParameters);
    const int gridSide = 128;

    if (definition.Width % gridSide != 0 || definition.Height % gridSide != 0)
    {
      throw new ArgumentException(string.Format("Width and height must be a multiple of {0}", gridSide));
    }

    var blockWidth = definition.Width / gridSide;
    var blockHeight = definition.Height / gridSide;

    gpu.Launch(new dim3(gridSide, gridSide), new dim3(blockWidth, blockHeight), kernel, devLevels, devColors, devPalette, definition.Width, definition.Height, definition.SetLeft,
      definition.SetTop, definition.SetWidth, definition.SetHeight,
      definition.MaxLevels, devParameters);

    gpu.CopyFromDevice(devLevels, levels);
    gpu.CopyFromDevice(devColors, colors);

    gpu.FreeAll();
    gpu.UnloadModules();
  }
}

There are a number of improvements that could be made to the initialization code above, especially dynamic determination of whether to use OpenCL or CUDA as well as the best grid and block parameters given the capabilities of the graphics system. Right now it assumes you have a fairly high end NVIDIA graphics card no older than a couple of years at most.

The trickiest part of coding for the GPU is that you have to form an inherently parallel mental model. This model is also typically spatial (2 or 3 dimensional grids and blocks) which makes perfect sense for graphics but which may require a change of perspective to adapt it to a more abstract problem. The focus on number crunching in arrays also means that you may as well throw whatever object model you had in mind to represent the data out the window; although given how bloated exceedingly object oriented architectures tend to get, this is more refreshing than anything else.

At the core are the concepts of a grid and thread blocks within each grid cell. Grids and blocks are usually 2 dimensional but can also be 3 or even 1 dimensional. In my limited testing however using one dimensional grids and blocks resulted in horrible performance and 3 dimensional ones simply didn’t make sense given nature of the problem (2D image rendering). How you apportion the problem into the blocks and over the grid determines the performance you can expect. In my implementation above I have chosen a two dimensional grid and two dimensional thread blocks. The overall grid is 128×128 and each grid cell is divided up into thread blocks whose size depend on the size of the image we want to generate. Each kernel instance will then have to calculate from its block and thread indices what part of the problem it should work on i.e. which point and pixel. The fixed grid dimensions simplify things for me but introduce some limitations on the size of images that can be generated (the sides must be a multiple of 128) but its nothing that can’t be solved with some gentle resizing and cropping in the UI. Optimizing around grids, blocks and threads would be a blog post of its own and besides others have done it better already.

You may wonder why above I allocate memory on the GPU for both the levels and the bitmap data these are mapped to. The reason is that the levels contain information that would otherwise be lost in converting them to colors. Suppose we want to create a 3D scene out of the fractal landscape. Then we would need both the levels and the colors.

For the actual kernel implementation CUDAfy.NET uses a [CUDAFY] attribute that you apply to the methods you wish it to translate to CUDA or OpenCL compatible code (see below). Care must be taken not to do anything in those methods that cannot be compiled by either compiler. Creating a new array (new int[256]) is for example a big no no in the kernel unless you use a special method to allocate shared memory within each block (I have yet to test this). In my example I pass references to global memory arrays into the kernel. Another issue is that .NET developers can sometimes become sloppy with data types. An errant numeric literal can cause problems varying in level of severity depending on the capabilities of the graphics card and SDK you are using.

public class MandelbrotGpu: Mandelbrot
{

  public override string Name => "Mandelbrot (GPU)";

  public override Guid Id => new Guid("ed87ad6e2c984ef0aba5cf00f63b85a2");

  public override Guid? LinkedId => new Guid("bba39b3f89e542cfb13139319c46f10b");

  public override RegionData Generate(RegionDefinition definition, byte[] palette)
  {
    var data = new RegionData(definition);

    this.Execute("MandelbrotKernel", data.Levels, data.Colors, palette, definition);

    return data;
  }

  [Cudafy]
  public static void MandelbrotKernel(GThread thread, int[] levels, byte[] colors, byte[] palette, int w, int h, double sx, double sy, double sw, double sh, int maxLevels, double[] parameters)
  {
    var x = thread.blockDim.x * thread.blockIdx.x + thread.threadIdx.x;
    var y = thread.blockDim.y * thread.blockIdx.y + thread.threadIdx.y;
    var offset = x + y * w;
    var xstep = sw/w;
    var ystep = sh/h;
    var cx = sx + x*xstep;
    var cy = sy + y*ystep;
    var colorOffset = offset * 4;
    var level = MSetLevel(cx, cy, maxLevels);
    levels[offset] = level;

    if (level < maxLevels)
    {
      var paletteOffset = level * 3 % palette.Length;
      colors[colorOffset] = palette[paletteOffset + 2];
      colors[colorOffset + 1] = palette[paletteOffset + 1];
      colors[colorOffset + 2] = palette[paletteOffset];
      colors[colorOffset + 3] = 255;
    }
    else
    {
      colors[colorOffset] = 0;
      colors[colorOffset + 1] = 0;
      colors[colorOffset + 2] = 0;
      colors[colorOffset + 3] = 255;
    }

  }

  [Cudafy]
  public static int MSetLevel(double cr, double ci, int max)
  {
    const double bailout = 4.0;
    var zr = 0.0;
    var zi = 0.0;
    var zrs = 0.0;
    var zis = 0.0;
    var i = 0;
    while (i < max && (zis + zrs) < bailout)
    {
      zi = 2.0 * (zr * zi) + ci;
      zr = (zrs - zis) + cr;
      zis = zi * zi;
      zrs = zr * zr;
      i++;
    }
    return i;
  }
}

Contrast the above GPU implementation to the one for the CPU (they render identical images).

Edit: I decided to give the CPU a fair shot and parallelized the rendering for it too. This resulted in a roughly four fold performance increase (still at least 5 times slower than the GPU). This is on a single Intel i7 CPU with four cores and hyperthreading. Notice the calculation of the MaxDegreeOfParallelism. It’s currently set to the number of logical processors times two (8 x 2 = 16) which seems like a reasonable trade-off.

public class MandelbrotCpu: Mandelbrot
{

  public override string Name => "Mandelbrot (CPU)";

  public override Guid Id => new Guid("9fe96fcd474649c6a6be3472ec794336");

  public override Guid? LinkedId => new Guid("c6d79046fed34ac9a800908b193218ad");

  public override RegionData Generate(RegionDefinition definition, byte[] palette)
  {
    var data = new RegionData(definition);
    var w = definition.Width;
    var h = definition.Height;
    var pixels = w * h;

    var sx = definition.SetWidth / w;
    var sy = definition.SetHeight / h;

    var degree = Environment.ProcessorCount * 2;

    Parallel.For(0, pixels, new ParallelOptions { MaxDegreeOfParallelism = degree }, index =>
    {
      var i = index % w;
      var j = (index - i) / w;
      var x = definition.SetLeft + i * sx;
      var y = definition.SetTop + j * sy;

      var level = MSetLevel(x, y, definition.MaxLevels, data.Colors, index * 4, palette);
      data.SetLevel(index, level);
      var colors = data.Colors;
      var colorOffset = index * 4;
      if (level < definition.MaxLevels)
      {
        var paletteOffset = level * 3 % palette.Length;
        colors[colorOffset] = palette[paletteOffset + 2];
        colors[colorOffset + 1] = palette[paletteOffset + 1];
        colors[colorOffset + 2] = palette[paletteOffset];
        colors[colorOffset + 3] = 255;
      }
      else
      {
        colors[colorOffset] = 0;
        colors[colorOffset + 1] = 0;
        colors[colorOffset + 2] = 0;
        colors[colorOffset + 3] = 255;
      }
    });

    return data;
  }

  private static int MSetLevel(double cr, double ci, int max, byte[] colors, int colorIndex, byte[] palette)
  {
    const double bailout = 4.0;
    var zr = 0.0;
    var zi = 0.0;
    var zrs = 0.0;
    var zis = 0.0;
    var i = 0;
     
    while(i < max && (zis + zrs) < bailout)
    {
      zi = 2.0*(zr*zi) + ci;
      zr = (zrs - zis) + cr;
      zis = zi * zi;
      zrs = zr * zr;
      i++;
    }

    if (i < max)
    { 
      var paletteIndex = i*3%palette.Length;
      colors[colorIndex] = palette[paletteIndex+2];
      colors[colorIndex+1] = palette[paletteIndex+1];
      colors[colorIndex+2] = palette[paletteIndex];
      colors[colorIndex + 3] = 255;
    }
    else
    {
      colors[colorIndex] = 0;
      colors[colorIndex + 1] = 0;
      colors[colorIndex + 2] = 0;
      colors[colorIndex + 3] = 255;
    }

    return i;
  }
}

There is of course more to the implementation than this – for one I might put the source up on GitHub if there is enough interest – but it should hopefully be enough to wet you appetite and get you started.

Edit: The source code for this article is now up on GitHub.

The CUDAfy.NET distribution comes with some excellent example applications that will help you on the way as well.

Going forward I’m planning to look into optimizing the grid, block, thread allocations further, look into using shared memory to minimize access to global memory and experimenting further with CUDA for simulation and machine learning.

Pushing Lightswitch at VSLive Chicago 2013

VSLive Chicago 2013 Keynote

The day 1 keynote at Visual Studio Live in Chicago is titled “Visual Studio, .NET and the Cloud” and was delivered by Jay Schmelzer (Director of Program Management for the Visual Studio Team at Microsoft). Heavy emphasis was put on development with Visual Studio Lightswitch, Sharepoint and Office 365.

Visual Studio Lightswitch is an interesting solution for those plain vanilla business applications that really don’t benefit from a fully custom approach. Why reinvent the wheel by constantly writing the same old validation code for things like phone and social security numbers?

At the same time the developer community seems to be almost universally skeptic as to whether it is actually a good idea. There are a number of reasons for this. Some developers are dismissive of the notion that we will ever be able to create complex business applications without writing code. Yet others object to the notion that non-technical people will be able to use the tool. It may not be development but is still quite complex.

Regardless of the possible objections I think this is an interesting option which might become popular for those business applications that don’t really need to follow a custom design or have much in the way of really complex business logic.

In short, even though the idea of developing in Visual Studio Lightswitch doesn’t exactly fill me with warm and fuzzy feelings, I do see a business case here. If you have a different opinion please don’t hesitate to comment.

Visual Studio Live 2013 in Chicago

Chicago

I am currently at the Visual Studio Live conference in Chicago attempting to absorb all the information I can on subjects as diverse as Windows 8 application development, Windows Azure, and Node.js.

The first day is all workshops and I’m attending the “Build a Windows 8 Application in a Day” workshop held by Rockford Lhotka. The term “workshop” is a bit misleading as it is more of an in-depth overview of developing for the Windows App Store whether through C# or JavaScript.

Given that this conference is not arranged by Microsoft, the discussion turns out to be quite candid and open. Windows 8 represents a significant change and is in many areas (such as touch and tablet computing support) clearly an improvement. It is obvious however that the right-brain, left-brain split between the “Metro” and the Desktop sides has caused significant confusion in the developer community.

Once I have collected my thoughts and finalized some snippets of code I have been working on I will post it here. Interestingly enough I found myself, during the day, drifting over to the JavaScript side of Windows 8 application development rather than the more natural (for me) C# route.

Autonomous Sphero with orbBasic

This is a quick update on my earlier post where I have included orbBasic code for the Orbotix Sphero which will give it some rudimentary autonomy. Observe that the code comments below should not be sent to the Sphero. I am currently working on creating a more user friendly tool for loading basic into the sphero. The undocumented command line tool included in the embryonic framework from my previous post will read an orbBasic file, strip the comments, and transfer it to the Sphero. However, since I only posted source code you will need Visual Studio to compile and run it.

' We set color to green and wait for  
' a small bump to start the program 
' proper.
10 basflg 1
20 RGB 0, 255, 0
30 if accelone < 4000 then goto 30 
' Initializing
40 H = rnd 359
50 D = 1
60 S = rnd 192 + 63
' Set random color
70 RGB rnd 255, rnd 255, rnd 255
' Roll in random direction until 15 
' seconds have passed (to avoid 
' getting stuck after soft collision) 
' or until we hit something.
80 goroll H, S, D
90 timerA = 15000
100 timerB = rnd 1500 + 1500
110 if accelone > 5000 or timerA = 0 then goto 150
120 if timerB > 0 then goto 110
' Every few seconds we randomly 
' adjust our heading somewhat 
' (+/- 15 degrees) and continue.
130 H = (H + rnd 30 - 15) % 360
140 goto 60
' We hit something and perform 
' a hard reverse.
150 H = ((H + 180) % 360)
160 goroll H, 255, 2
170 delay 1000
' Lets take it from the top.
180 goto 60

Balls out fun with the Sphero and .NET

Having received a Sphero robotic ball (made by Orbotix) as a christmas present from my wife (yes, she is truly amazing) I went into a frenzy of coding over the holidays. Robotics have always interested me but I never got around to experimenting with it before now.

For those of you looking to tinker a bit with robotics I cannot recommend this little gadget highly enough.  The Sphero is a hermetically sealed spherical shell of polycarbonate crammed full of robotic goodness. Built in sensors include an accelerometer, a compass, and a gyroscope. All communication with the Sphero is performed via Bluetooth through the sealed shell which makes sense as it is intended to be used outdoors and even in water (it floats). The device is thoroughly documented which allows for some very satisfying hacking activity, something which seems to be encouraged by Orbotix, the company behind this little marvel.

sphero

Driving the Sphero around the livingroom using a smartphone for a controller is fun and extremely entertaining for adults, offspring and felines alike, although, for a device promoted as a Robot, it started to feel more and more like a radio controlled ball. A true robot is supposed to scurry around the place and do stuff without me having to frantically wave my phone around. After searching the developer forums for a while I found that there is actually an implementation of BASIC (orbBasic) which will run on the Sphero and allow it to exhibit more autonomy. My first goal became to find a way to upload an orbBasic program to the Sphero from a .NET application and run it.

Sadly the developer SDK’s are only available for mobile platforms (namely Android and iOS) for now. Luckily the low-level documentation for the Sphero is excellent and allowed me to ping the device via Bluetooth from a .NET application and receive a reply within an hour or so of tinkering.

After some additional work and pooring over the documentation I now have a fairly decent (if hurried) experimental framework SpheroNET v0.1 up and running. Please note that it IS experimental and is likely to throw exceptions the moment you look at it funny. The solution includes a console application for testing but nothing else.

What was needed:

  1. A Sphero robotic ball.
  2. A way of sending and receiving data over bluetooth in .NET. For this I used the excellent (and free) 32feet.NET.
  3. The API documentation
  4. orbBasic documentation

First we need to find some devices to connect to.

BluetoothClient client = new BluetoothClient();
List<BluetoothDeviceInfo> devices = new List<BluetoothDeviceInfo>()
devices.AddRange(client.DiscoverDevices());

Once we have retrieved a list of available devices and selected your sphero from among them we need to connect to it. Note that we have to allow for a number of retries. This is because the connection process would otherwise, for unknown reasons, frequently fail. A succesful connection returns a NetworkStream which you will be able to write to and read from.

private NetworkStream Connect(BluetoothDeviceInfo device, int retries)
{
  BluetoothAddress addr = device.DeviceAddress;
  Guid serviceClass = BluetoothService.SerialPort;
  var ep = new BluetoothEndPoint(addr, serviceClass);
  for (int i = 0; i < retries; i++)
  {
    try
    {
      _client.Connect(ep);
      break;
    }
    catch (Exception ex)
    {
      Thread.Sleep(300);
      if (i == (retries - 1))
        throw new Exception(
        string.Format("Could not connect after {0} retries.", retries), ex);
    }
  }
  NetworkStream stream = _client.GetStream();
  return stream;
}

Data sent to the Sphero needs to be formatted into binary sequences i.e. packets that it can understand. There are three types of packets; command, response and asynchronous packets. The format for all three is quite similar which is why I chose to base all packets on a single abstract base class. The primary function of this class is to calculate and update the packet checksum as well as to allow access to the two fields (SOP1 and SOP2) common to all three packet types.

public abstract class SpheroPacket
{
  protected byte[] _data = null;
  public bool IsValid
  {
    get
    {
      return CalculatedChecksum.HasValue ?
       Checksum == CalculatedChecksum.Value :
       false;
    }
  }

  public byte[] Data
  {
    get
    {
      return _data;
    }
  }

  public byte SOP1
  { get { return _data[0]; } }

  public byte SOP2
  { get { return _data[1]; } }

  public byte Checksum
  {
    get { return _data[_data.Length - 1]; }
    set { _data[_data.Length - 1] = value; }
  }

  public byte? CalculatedChecksum
  { get { return GetChecksum(); } }

  public SpheroPacket()
  {
    _data = new Byte[] { 0xFF, 0xFF };
  }

  public void UpdateChecksum()
  {
    byte? checksum = GetChecksum();
    if (checksum.HasValue)
    {
      _data[_data.Length - 1] = checksum.Value;
    }
  }

  public byte? GetChecksum()
  {
    if (_data == null || _data.Length < 4) return null;
    uint sum = 0;
    for (int i = 2; i < _data.Length - 1; i++)
    {
      sum += _data[i];
    }
    return ((Byte)~(sum % 256));
  }

  public override string ToString()
  {
    const string invalid = "[invalid checksum!]->";
    byte[] data = Data;
    StringBuilder sb = new StringBuilder(data.Length * 3);
    if (!IsValid) sb.Append(invalid);
    foreach (var b in data)
    {
      sb.Append(string.Format("{0:X02}", b));
    }
    return sb.ToString();
  }
}

Much of what you might want to do with the Sphero can be accomplished by simply sending properly formatted command packets. The SpheroCommandPacket class below should be able to produce any command described in the API documentation if configured correctly through its constructor.

public class SpheroCommandPacket : SpheroPacket
{
  public byte DeviceId
  { 
    get 
    { 
      return _data[2]; 
    } 
    set
    {
      _data[2] = value; 
      UpdateChecksum(); 
    } 
  }

  public byte CommandId
  { 
    get 
    { 
      return _data[3]; 
    } 
    set
    {
      _data[3] = value;
      UpdateChecksum();
    }
  }

  public byte SequenceNumber
  { 
    get
    {
      return _data[4];
    }
    set
    {
      _data[4] = value;
      UpdateChecksum();
    }
  }

  public byte DataLength
  { 
    get
    {
      return _data[5];
    }
    set
    { 
      _data[5] = value;
      UpdateChecksum();
    }
  }

  public SpheroCommandPacket(
    byte deviceId, byte commandId, 
    byte sequenceNumber, byte[] data): base()
  {
    List<byte> list = new List<byte>();
    list.AddRange(_data);
    list.AddRange(new Byte[] { deviceId, commandId, sequenceNumber });
    if (data != null)
    {
      list.Add((byte)(data.Length + 1));
      list.AddRange(data);
    }
    else
    {
      list.Add(0x01);
    }
    list.Add(0xFF); // Placeholder for checksum
    _data = list.ToArray();
    UpdateChecksum();
  }
}

Obtaining properly formatted command packets is a simple matter of implementing a factory class for that purpose.

public static SpheroCommandPacket EraseOrbBasicStorage(StorageArea area)
{
  return new SpheroCommandPacket(0x02, 0x60, 0x01, new byte[] { (byte)area });
}

public static SpheroCommandPacket AppendOrbBasicFragment(StorageArea area, string fragment)
{
  List<byte> data = new List<byte>();
  byte[] fragBytes = Encoding.Default.GetBytes(fragment);
  data.Add((byte)area);
  data.AddRange(fragBytes);
  return new SpheroCommandPacket(0x02, 0x61, 0x01, data.ToArray());
}

public static SpheroCommandPacket ExecuteOrbBasicProgram(StorageArea area, UInt16 fromLine)
{
  byte[] data = new byte[3];
  data[0] = (byte)area;
  data[1] = (byte)((fromLine & 0xFF00) >> 8);
  data[2] = (byte)(fromLine & 0x00FF);
  return new SpheroCommandPacket(0x02, 0x62, 0x01, data);
}

public static SpheroCommandPacket AbortOrbBasicProgram()
{
  return new SpheroCommandPacket(0x02, 0x63, 0x01, null);
}

Below we send the assembled packet over the NetworkStream. Notice how we also assign a sequence number (one byte) to the packet. This is because the sequence number is echoed in any resulting response packets which will hopefully allow you to figure out which response belongs to which command. How you go about relating asynchronous packages from the Sphero to the commands that triggered them is another question for which I have no definitive answer at the moment. Asynchronous packets may be sent as a result of commands or may be sent as a result of some internal trigger in the Sphero (for example just before it goes asleep).

public void SendPacket(SpheroCommandPacket packet)
{
  packet.SequenceNumber = GetNextSequenceNumber();
  Stream.Write(packet.Data, 0, packetData.Length);
}

Sending an orbBasic program to the Sphero is now quite straightforward. The program below, once sent and executed, will make the Sphero glow a steady green as long as it is not disturbed. Once a preset acceleration threshold is reached it will pulse red for about 10 seconds. This is, admittedly, not a very exciting program but it is a good start.

10 RGB 0, 255, 0
20 if accelone > 1700 then goto 40
30 goto 20
40 for J = 1 to 8
50 for I = 0 to 255 step 10
60 RGB I, 0, 0
70 delay 50
80 next I
90 next J
100 goto 10

For a fuller picture of how you go about receiving and classifying incoming packages (response or async) please refer to the full SpheroNET source code.

For the future I have a number of improvements planned. Firstly I will spend some more time writing orbBasic code to make the Sphero behave like the robot it is, much to the delight, I would guess, of children and cats alike. Secondly all the commands listed in the API documentation should be implemented. An intuitive way of relating sent command packages with their resulting response packets (and asynchronous packets when possible) should also be found.

As always any feedback is welcome.

Raspbmc RC4 on Raspberry Pi

I installed the latest release candidate of Raspbmc (RC4) on my Raspberry Pi during the weekend and have been enjoying flawless video playback ever since.

Installation, as seems usual with the Raspberry Pi (and highly unusual for most Linux flavors), was a breeze. Just download the installer for your OS and follow the instructions on the site. The installer downloads a minimal OS image and writes it to the selected SD card. The next step is simply to insert the card into the Raspberry Pi and power it on. At which point it will proceed to download and install the rest of the system directly from a Raspbmc download server (the Pi obviously needs to be connected to the internet at this point). The most arduous part was making a cup of coffe to enjoy while this was going on (which took about 25 minutes).

Raspbmc in action (UI after startup).

This new RC is based on Raspbian and has support for floating point calculations in hardware. Even though video playback was ok in RC3, UI interaction would become sluggish while playing video. Thankfully this is no longer an issue. Assigning the Raspberry Pi to media center duty has the added benefit of no longer having to put up with the power guzzling and jet engine sound level of the Xbox360. In contrast the Raspberry Pi runs perfectly silent from a 5 volt generic cell phone charger (at 700 mA). Thinking for a while about where to place it I finally decided to just dump it into the cable mess behind the TV bench. It is so tiny it is hardly noticed there anyway (even cased). Interaction is via a wireless keyboard and mouse for the moment, although I might investigate the possibility of hooking it up to our remote. I highly recommend the Logitech Wireless Combo MK260 (check out this site for a list of verified peripherals).

In short, for those of you interested in a low-cost media center computer, choosing the Raspberry Pi is a no-brainer.

Curiosity has landed!

I have to say I’m impressed. When I first saw the landing procedure intended for the lander Curiosity headed for Gale crater on Mars I was a bit skeptical. The extremely complicated landing sequence employed for the Curiosity mission seemed like asking for trouble, given that Nasa has a bit of a spotty record when it comes to delivering hardware to Mars.

Curiosity lander.

The curiosity lander was to be lowered to the surface of mars on wires suspended from a delivery vehicle autonomously balancing on thrusters. Some may remember the Mars Climate Orbiter Metric vs. English unit mixup so a bit of scepticism didn’t seem to be misplaced. The more points of potential failure you add would seem to increase the odds of something unforeseen ruining the party.

As it turned out the landing went of without a hitch. Big kudos to everyone involved in this project who accomplished something truly amazing.

Waiting several months for my Raspberry Pi to arrive meant that I have had plenty of time to come up with potential uses for it. The Raspberry Pi caught my attention because it seems to be one of those subversive technologies that come in under the radar and changes everything. Nothing about it is truly revolutionary yet it delivers fully fledged computing with virtually the entire GNU/Linux software catalog (and others) behind it, at an unbeatable price point (roughly 35$ for a bare model B board). The possibilities are literally endless. The Raspberry Pi and its successors will be used for robotics, sensors, DIY drones, multimedia centers, and home servers to name but a few. As it is an open hardware design we will probably see clones coming out of China by the shipload. In some way it also seems to fit in nicely with the whole wimpy core trend for data centers (which I admit is not uncontroversial).

This is what arrived in the mail last week:

Raspberry Pi on arrival.

My own plans for the device are slightly more modest though. Personally I’m interested in using these devices as an always-on home server, as an ultra small footprint media center (if this particular use pans out) and for work I would like to investigate using them to drive the wall screens which are proliferating all over the office.

Source code management (running Git)

I have avoided having a server running 24/7 at home for a number of reasons. They take up space, are noisy, cost money to run (admitedly not a huge deal),  and cost a not insignificant amount of money to buy. The Raspberry Pi negates all these objections handily. It takes no space, makes zero noise, pulls almost no power (idle or otherwise), and comes in under 100$ with all accessories.

Replacement for the XBox 360 as a media center

I haven’t tried this yet but it looks promising. If the codec support is there via XBMC and the performance is up to snuff it should handily beat the XBox.

Project dashboard for work

At Active Solution where I work we have begun installing wall screens to display project relevant information such as Trello boards (digital post-it replacement) and information piped to web based dashboards from our various issue tracking systems (depends on the client). Currently we have dedicated laptops serving each wall screen which is workable when we only had two. As the number of wall screens grow this approach is quickly becoming untenable. I decided to try the Raspberry Pi out as a replacement for all those power-hungry laptops.

Setup

The Raspberry Pi comes with nothing, not even a case, included (except a nice t-shirt) so you will need to get a couple of things in order to get started.

  • 4GB+ SDHC card (Class 4)
  • Micro usb charger/power supply (minimum 700 mA)
  • Mouse
  • Keyboard
  • HDMI cable
  • Screen (with HDMI input)
  • Optional: a case (I just ordered one from ModMyPi)

I will not go into great detail about how you get the Raspberry Pi up and running save to say that it was a breeze (check out the Raspberry Pi foundation quick start guide). The Debian Squeeze Linux image I opted for was ready to be written directly to the SD card (no installation needed). All you need to do is change the default password on first boot.

Raspberry Pi up and running

Raspberry Pi up and running

It took literally no more than 20 minutes to download the Linux image, write it to the SD card, assemble the Pi and hook it up to the monitor. The only snag at this point was that we had to enable a previously disabled network socket close to the intended screen. A bigger problem turned out to be the supported browsers on the Pi. As of this writing none of the browser versions available (of Midori and Chromium) supports HTML5. This is a problem for us as the dashboard software we are using (our own as well as Trello) requires HTML5 support.

Raspberry Pi on wall screen

Raspberry Pi (with Debian Squeeze and LXDE desktop) on wall mounted flatscreen.

The solution

Not one to be discouraged I soon had an idea for how to get around this limitation. Even though we might conceivably upgrade to touch screens some time in the future our wall screen dashboards are non-interactive and simply display a pre-set web page (refreshing periodically). I recently attended an in-house seminar conducted by my eminent co-workers Chris Klug and Robert Folkesson on among other things the open source HTTP service framework NancyFx. Here was a light weight solution that could be used for serving up images of web pages rendered in a browser that supports HTML5 to a wall screen connected Raspberry Pi. It was of course also a great opportunity to learn more about NancyFx.

WebSnap

I began outlining a web page snapshot service named WebSnap and soon had a prototype up and running. As our company specializes in Microsoft technology this was implemented as a WPF application in Visual Studio 2010.  At the moment of writing it is functional albeit not quite “done”. I will outline the general idea here and plan to post the full source once it has matured somewhat. For simplicity (most certainly not for scalability) I opted for simply capturing screenshots of a pages rendered in a WebBrowser control hosted in a WPF application. The ideal would of course be to render pages fully off-screen but for simple proof of concept this seemed a bit excessive. If anyone has any suggestions for solving this particular problem don’t hesitate to suggest it.

To include NancyFx (self hosting) in a Visual Studio 2010 project using the Nuget PM:

PM> Install-Package Nancy
PM> Install-Package Nancy.Hosting.Self

To start self hosting is as simple as:

NancyHost host = new NancyHost(new Uri("http://localhost"));
host.Start();
...
host.Stop();

All you really need to start to serve data over HTTP is to implement a NancyModule. Here I leaned heavily on an excellent blog post by Andre Broers on self hosting Nancy and utilizing the SuperSimpleViewEngine included with NancyFx. The full source includes view models, views and various support classes which I have not include in this post. Please refer to Andre Broers excellent post for an overview of how to do this.

SnapModule here inherits from BaseModule which in turn inherits from NancyModule. As you can se we are setting up a number of routes with different “handlers” attached.

using System;
using System.Drawing.Imaging;
using System.IO;
using System.Text;
using System.Threading;
using System.Windows;
using Nancy;
using Nancy.Responses;
using WebSnap.Models;

namespace WebSnap.Modules
{
 public class SnapModule : BaseModule
 {

  delegate void UIDelegate();

  public SnapModule()
  {
   Get["/content/{url}"] = x => Response.AsFile((string)x.url);

   Get["/load/{id}"] = x => { return Load(x); };

   Get["/images/{url}"] = x => Response.AsFile(".\\Images\\" + (string)x.url);

   Get["/"] = x =>
    {
     Model.LandingPage = new LandingPageModel();
     Model.LandingPage.Heading = "WebSnap";
     return View["LandingPage", Model];
    };

   Post["/"] = x =>
    {
     string url = (string)Request.Form.Url;
     string res = (string)Request.Form.Res;
     string hex = StringToHexString(url + "@" + res);
     return Response.AsRedirect(string.Format("/load/{0}", hex));
    };

  }

  private dynamic Load(dynamic p)
  {
   string hex = (string)p.id;
   string query = HexStringToString(hex);
   string[] parts = query.Split('@');
   string siteUrl = parts[0];
   string resolution = parts[1];
   string imageUrl = string.Format("/images/{0}.png", hex);
   string imagePath = string.Format(".\\Images\\{0}.png", hex);
   Model.SnapPage = new SnapPageModel();
   Model.SnapPage.Image = imageUrl;
   string[] dimensions = resolution.Split('x');
   Size size = new Size(double.Parse(dimensions[0]),
    double.Parse(dimensions[1]));
   // Do we need to refresh the image?
   TimeSpan imageAge = File.Exists(imagePath) ?
   GetFileAge(imagePath) : TimeSpan.MaxValue;
   if (imageAge > TimeSpan.FromMinutes(5))
   {
    // Set up delegate to navigate to page
    // in the web browser control...
    UIDelegate navigateToPage = delegate()
    {
     UI.Instance.Resize(size);
     UI.Instance.LoadCompleted = false;
     UI.Instance.WebBrowser.Navigate(siteUrl);
    };
    // and one to capture a screen shot of
    // the rendered page.
    UIDelegate captureImage = delegate()
    {
     Capture.Window(imagePath, ImageFormat.Png, UI.Instance.WebBrowser);
    };

    // Load the page in the web browser.
    Application.Current.Dispatcher.Invoke(navigateToPage);

    // Wait for the website to load on the UI thread (max 10 sec).
    WaitForLoad(TimeSpan.FromSeconds(10));

    // Capture image from the web browser.
    Application.Current.Dispatcher.Invoke(captureImage);
   }
   return View["SnapPage", Model];
  }

  private static void WaitForLoad(TimeSpan max)
  {
   DateTime start = DateTime.Now;
   do {
    if (UI.Instance.LoadCompleted) break;
    Thread.Sleep(100);
   } while ((DateTime.Now - start) < max);
  }
  ...
 }
}

The reason for marshalling all interaction with the WebBrowser controll via the dispatcher is that it is owned by the UI thread and will not take kindly to being manipulated from another thread. The NancyFx module is not going to be running on the UI thread so hence the marshalling. This of course creates a bottle (pain in) neck when we try to scale the solution up. Luckily only so many wall screens will fit in the office.

When you first navigate to the host address you will be presented with a landing page where you can enter the page to snapshot and the size of the snapshot (at the moment this is approximate).

Landing page

The landing page

When you press load the server will be instructed to load the requested page into the hosted web browser (HTML5 compatible) and will reply with a page rendering that image full window size and centered. If the same client (or another) request the same page (at the same resolution) again the same image will be returned unless the image is older than 5 minutes in which case another will be captured. The end of the url here is a hexadecimal string created by simply transforming http://www.google.com@800×600 to 687474703a2f2f7777772e676f6f676c652e636f6d4038303078363030 (each ASCII digit = two hexadecimal digits). Crude but effective (if not exactly memorable).

WebSnap

Snapshot of webpage loaded from WebSnap

At this point this is essentially a somewhat ugly hack. No consideration has been given to how this is supposed to handle increased load or whether or not it is secure. If it will be used it will be used on an internal network for the sole purpose of feeding a handful of wall screens with web page snapshots.

With this I conclude this report on my initial adventures with the Raspberry Pi. I hope to post more on this subject as my experience with the device grows and my various projects mature.

Raspberry Pi + Flatscreen + Nancy = Dashboard