- Hypergrid Business - https://www.hypergridbusiness.com -

The tech behind OSCC13

As many of you know, early last September we held the first OpenSimulator Community Conference [1], a purely online event hosted in OpenSimulator itself. It was produced collaboratively by the Overte Foundation [2] and Avacon [3], with the planning process beginning in February 2013 and increasing in intensity right up until the conference itself.

It consisted of 23 separate regions, with a keynote area, breakout session areas, landing zones, a sponsor expo area and a staff area. Planned capacity for the conference was 220 avatars though avatar numbers in the expo zone was unrestricted. The conference was financed through volunteer effort (time is money!) and sponsorships. Conference registration itself was free.

The event was a great success – everybody that I’ve heard from, whether attendee, volunteer, speaker or sponsor, was very positive about it.

A keynote region at the OpenSimulator Community Conference. (Image courtesy Justin Clark-Casey.) [4]
A keynote region at the OpenSimulator Community Conference. (Image courtesy Justin Clark-Casey.)

In this series of blog posts, I’m going to talk about the technical, organizational and human sides of the conference, as well as some thoughts about virtual conferences in general. I’m quite well placed to this as I was both the co-chair of the conference and a major part of the technical effort to provide a stable and high performance OpenSimulator installation.

First up is the technical side. I hope this will be of both general interest and useful to anybody who is thinking of hosting a similar event in the future. So please feel free to ask any technical questions in the comments.

The Hardware

In this particular post I’m going to talk about the hardware that we used in the conference. All 23 regions were hosted on a single 24-core Intel Xeon X5650 machine with 64G or RAM. This machine was hosted on a high-bandwidth low-latency network (not a home network).

There’s an approximate rule of thumb [5] which says that a machine running OpenSimulator should have one CPU core per region simulated. However, final performance in the conference was very good – it’s likely that we could have managed with considerably fewer cores. It’s very difficult to determine in exact ratio, however, without extremely time-consuming performance testing. It’s also heavily dependent on the number of objects in regions, number and complexity of scripts, number of avatars in the region, etc.

We can say that we had more memory than required. Maximum memory use during the conference was approximately 16G, about 700M per region. This was with a separate OpenSimulator instance per region – hosting multiple regions in a single instance would use less memory, though the difference may not be very large. So a good safe rule of thumb is to allocate 1G per region. Each region was hosted in its own simulator to provide fault isolation between regions — if the Mono runtime failed in one simulator it would only bring down that region.

In terms of network capacity, we were never going to have any issues. From analysing network data, we can say that an approximately rule of thumb is to have 500 kbit download from the server available for each connected avatar and 50 kbit upload. More details on the OpenSimulator wiki [6].

Using a single machine for the conference was convenient for administration purposes. It also eliminated one potential source of teleport issues, as communication between source and destination simulator was internal to a single machine. However, it also made the conference vulnerable to a failure in that single machine. To counter this, one can either have a duplicate machine available with a copy of the entire conference installation (as we did) or spread the regions and grid services out amongst multiple machines to reduce the consequence of failure in any single machine (though one is still vulnerable to a database server failure).

One could also host duplicate or multiple simulators at different physical locations in case the network at a particular location became unavailable for a certain period of time. But the thing to bear in mind is that all these choices are tradeoffs – greater redundancy involves greater cost, operational work and in some cases potentially decreases reliability (e.g. teleports between different locations not on the same LAN). You can compare this with physical location reliability.

For instance, a physical conference centre may have a fire drill which puts everything out of action for a certain period. Or a volcano in Iceland may start spewing flight-halting volcanic ash [7] to disrupt flights, as happened with the Metameets 2010 conference.

Early Struggles

In virtual environments that run the Second Life protocol, a large event is often held over a four-region area in order to spread the processing load over four independent simulator instances rather than 1, or even over multiple different machines rather than a single server.

For the OpenSimulator Community Conference, every region would run on the same machine. However, there was still value in spreading users over multiple regions. Although an OpenSimulator instance launches threads with extreme enthusiasm for all sorts of different tasks, there are still a number of single thread processes that can potentially act as bottlenecks. For example, there’s a single thread that processes incoming UDP messages from viewers, one to send them back out to viewers and another to co-ordinate aspects of the scene itself, such as physics and avatar movements. There is room for improvement here (for instance, there has already been experimentation to process physics on a separate thread) but such work is highly complex. At this point, it’s much easier to spread the load between different regions.

To further ease performance issues, we also prevented avatars from crossing between the regions, as region crossing is currently a heavyweight process and not always reliable, especially in situations where source and destination regions are highly loaded. We also instituted a scheme so that most conference attendees could only enter the keynote region to which they were assigned, partly in order to eliminate any extra load generated by users teleporting between them.

Even with all these measures, we were really struggling with performance in the beginning. Getting 80 avatars into the keynote regions sent CPU load skyrocketing, straining even our 24 core system. There was a real worry on my part that we would have to shard the keynotes (i.e. have two identical copies of the regions and relay the presentations from one region to another). Understandably, nobody was enthusiastic about that – it would have been a real ding on the sense of everybody being in a single virtual place, as well as causing some significant organizational difficulties.

So to tackle these performance issues we instituted weekly load tests from May right up until the conference itself. Anybody was invited to come along and help stress test the environment. Because the infrastructure to process avatar registrations was not yet in place, most people entered the regions via the hypergrid [8].

Find, Fix, Stress

Over these weeks, there were three major activities that we had to carry out in order to improve performance. Firstly. we had to find the performance bottlenecks. Secondly, the performance improvements themselves had to be devised, debugged, implemented and then tested under load. Lastly, we had to extend existing test tools to create a suitable synthetic bot load on the system.

I’m going to say a little bit about each of these things in turn.

Find

Very broadly speaking, there are two kinds of bugs. Firstly, there are the bugs suffered by a single user with a set of steps that will reproduce the problem every single time. These are not necessarily simple, but at least the developer can recreate them and sooner or later pin them down to a particular place in the code.

Then there are the bugs which only occur under certain conditions such as heavy user load, unanticipated combinations of client behaviour or unpredictable network response times. In this case, it’s often obvious to the user when there is a problem (e.g. my avatar keeps freezing) but often not at all obvious why that problem is occurring. Moreover, these problems are often extremely difficult to recreate outside of that particular combination of events.

It’s the second kind of bugs which really challenged us on the technical side for the conference. You can get some traction on such issues with an expert knowledge of the system and many fixes were performed this way, particularly as we had the opportunity week by week to observe the effects of changes.

But it was also necessary to start measuring many new internal statistics (e.g. number of inbound UDP messages received per second, number of messages waiting to be handled by the system, number of different UDP messages sent by each connection). This is the kind of data that splashes out if you run the command “show stats all” on the simulator console. There is also an experimental feature to record statistical information every 5 seconds for later analysis (“debug stats record start|stop”).

This extra information helped us work out which aspects of the system were associated with performance problems and get a better grasp on system behaviour in general. However, even now, it’s still the case that much of this information is probably very difficult to interpret without a deep knowledge of the underlying mechanisms.

Fix

Over the course of five months of load tests, we made many changes to OpenSimulator. These changes addressed both raw performance issues (e.g. handling more avatars per region) and issues that appeared only under heavy load (e.g. mesh sometimes not being received by avatars teleporting in when a large number of other people were already connected).

One issue in particular was the handling of incoming avatar movement messages. Most viewers (clients) connected to OpenSimulator will send through a constant stream of AgentUpdate UDP packets, approximately 10 every second. These transmit changes to the avatar’s body and head rotation, camera position, etc.).

Many of these packets are identical or contain only very small changes (e.g. the avatar head rotation has changed by a fraction of a fraction of a degree). OpenSimulator was already discarding identical packets but only at a fairly late stage, and it was always processing packets where the changes were tiny compared to the last processed packet.

Hence, we started discarding packets at a much earlier stage, both those which were identical and those where the change from the last packet was so small that it was insignificant. This radically improved performance – we went from 80 avatars consuming more than half the available cycles of our 24 CPUs to those same 80 connections barely taking up 1 CPU.

This was the point at which we knew our server would be able to handle the planned conference load and it was a big relief! It also goes to show that in open-source, there’s nothing quite like making yourself “eat your own dogfood” – we had committed to put on a conference in OpenSimulator and so were highly motivated to spend the enormous time and effort necessary to get performance to where it needed to be.

Stress

Having people come in week by week to stress test our changes was invaluable. There’s absolutely no substitute for having real people connecting to the simulator using all sorts of different networks to build confidence that everything was going to work in the conference itself.

However, even with a fixed time for the tests and week by week publicity, we couldn’t get anywhere near enough real connections to match our planned 220 avatar target.

Therefore, we had to turn to a synthetic load, both to supplement real connections at load tests and to allow individual developers to at least approximate a high load when few real people were available.

We already had a test tool bundled with OpenSimulator called pCampbot, which creates a number of libopenmetaverse external client connections to stress test various aspects of the simulator (e.g. you can make such bots continuously teleport around until a failure does (or doesn’t) happen).

However, the existing pCampbot code was very awkward to use in conjunction with real connections and in a situation where bots would have to be added and removed over a number of regions. Hence, we made a number of enhancements to this tool both to make it easier to manage bot connections and to introduce new types of behaviour (e.g. get all bots to occupy a sit target).

My hope is that this tool will be useful in the future for people to independently test their OpenSimulator installations. However, this does require me to seriously improve the documentation at the pCampbot OpenSimulator wiki page. Please feel free to tell me if that kind of thing would be useful – otherwise these things have a tendency to slip down the long priority list.

Next Time

In the next post, I plan to move onto some of the organizational aspects of putting on a conference in OpenSimulator and virtual worlds in general, such as grid management, region layout, planning committees, the people you need, etc. Stay tuned!

(Article reprinted with permission from Justin Clark-Casey’s OpenSimulator blog [9], where it appeared in two parts.)