Avination helps save OSgrid, donates cluster storage solution

OpenSim core developer Melanie Thielker is donating a new cluster storage solution to the community. Called FSAssets, it is designed to replace the MySQL databases or the RAID storage arrays used by other grids.

FSAssets is also the new asset storage for OSgrid, which had previously relied on RAID storage.

Melanie Thielker
Melanie Thielker

Any grid with more than 80 regions that has a separate server for its assets database should upgrade, Thielker said — and the earlier, the better.

Especially if the grid is an open grid, allowing external connections, assets can add up quickly.

Converting a terabyte of assets from MySQL to FSAssets can take weeks, she said.

“It’s better to do it sooner,” she added.

OSgrid had a database of 3.5 terabytes of data, or about 21 million separate files.

The problem with RAID

Last August, OSgrid went down after a failure of its RAID 10 storage, followed shortly afterwards by The Next Reality Grid. In October, Craft had its own issues with RAID as well.

Thielker had been helping out with OSgrid in the background, was familiar with how OSgrid was organized, and was a member of the OSgrid development chat group.

(Image courtesy SERT Data Recovery Services.)
(Image courtesy SERT Data Recovery Services.)

“So when OSgrid had the big crash, I was one of the first to find out, and was in a position to ask about the details,” she said. “They had trusted in RAID, and the hard disk had gone bad on them. At that point, there wasn’t much that I could do except go and tell OSgrid how these kinds of things can be avoided in the future, and that they need to rethink their storage structure in view of the recent failure.

“It is locking the barn door after the horse got stolen,” she added. “But at least the next horse wouldn’t be stolen.”

Thielker is no stranger to large asset databases. Avination is one of the largest OpenSim grids, and had also run into the limits of what the default MySQL database could do five years ago. However, instead of upgrading to RAID, she decided to take her grid one step further, to cluster technology, instead.

The results paid off. Avination’s uptime percentage is the highest in OpenSim, she said, and higher than that of Second Life.

“It is a strong argument that I could make of going for something proven to work without breaking in a high-availability environment,” she said.

FSAssets versus RAID

Avination’s cluster storage solution is called FSAssets and starts out where RAID leaves off.

The MySQL database that is the default storage system for OpenSim is composed of one giant file — the MySQL database itself.

When that got too difficult to manage, both OSgrid and Avination replaced that one big file with lots of smaller files — a separate file for each asset.

Then both put those files into RAID storage, which stands for Redundant Array of Independent (formerly Inexpensive) Disks.

RAID uses multiple hard drives to store duplicates of the data. The idea is that if one drive is busy, or broken, the other drive can step in.

“With that asset server, OSgrid managed to get over the hump that had been limiting its growth,” Thielker said.

At that time, Thielker said she could see the same issues coming down the road for the Avination grid, so she sat down and wrote her own asset server.

Her solution went one step further and added clusters.

 

“It is designed to grow by fragmentation,” she explained. “The minute you fill up a server, it splits into two, and when those two are filled up, it splits into four.”

This group of servers is the cluster, she explained.

“It appears as a single machine to the outside world,” she said. “It has one IP address, you send the request to the IP address, and you don’t know which member of the cluster actually answers. Inside the cluster, there is both redundancy and load-sharing.”

It turned out that her solution worked almost the same as OSgrid’s. Replacing OSgrid’s code with the FSAssets code required that she only change 10 lines of code.

When it turned out that the FSAssets code easily integrated into OSgrid’s infrastructure, Thielker realized that it could be added to the standard distribution of OpenSim as well.

“And it will appear in core after a bit of cleanup,” she said.

Cleaning the data

The most time-consuming part of the whole process wasn’t rewriting the code, but in cleaning up the recovered asset data.

“It took three weeks of massaging the data to make it useful,” she said. “I was in a position to know how to go about this, since Avination has been around for five years now, and we had similar issues. Except our issues were always hidden from our users because we always had between two and eight copies of everything.”

Now OSgrid is in a similarly strong position, she said.

“When they lose a machine, nobody will notice, because the other one keeps running and provides full service to everyone until the first machine is fixed,” she said. “It is called high availability architecture and that is what banks, airlines and so forth use.”

Maria Korolov