OSgrid lost new content last month

OSgrid lost a chunk of new inventory items over the course of this past month, but the problem doesn’t affect other grids.

“It appears that for the last month or so, the asset server was unable to write assets to the database,” OSgrid administrator Allen Kerensky said in yesterday’s announcement.

The grid has not yet released numbers as to what percentage of the new content was lost, but some residents are reporting that up to 75 percent of their content was okay.

The fact that inventory items were being cached by viewers or individual regions hid this problem, he explained.

It only became apparent when the asset server was restarted for emergency patching on Thursday, April 16.

The grid has already taken steps to correct the problem, he added.

Entrance of Wright Plaza on OSgrid, the grid's main offices.

Entrance of Wright Plaza on OSgrid, the grid’s main offices.

“Several changes have been made to the asset cluster to attempt to prevent a repeat of this type of event, or detect and report it if it starts to happen again,” he said. “OSgrid has changed the database indexing system, added new logging and exception reporting, and additional database write reporting. Additional asset write-and-verify tests are being implemented going forward.”

As a result of the fixes, the database is now working correctly, he said.

Grid residents should now restart their regions and try to reload the assets, if they can, from their original sources, or try to recover them from IAR or OAR backups, he said.

“We feel your pain with these ‘missing asset’ messages, as the OSgrid admin inventories and assets were not immune and are experiencing the same issues you are,” Kerensky concluded. “We deeply apologize for any inventory or asset issues that you have encountered.”

OSgrid residents were mostly philosophical about the inventory losses. They’ve recently returned to the grid after a six-month-long outage, so they are prepared for difficulties.

However, some were unhappy about the way the grid communicated about this issue.

“I do have a real complaint about grid management not doing a good job about getting the word out on something like this,” said Danko Whitfield in a comment on Google Plus. “That is a much bigger problem. That is an area that other grids do much better at than OSgrid.”

For example, residents pointed to the title of yesterday’s announcement — “Asset cluster maintenance complete” — which seemed worded so as not to catch anyone’s attention.

Other grids not affected

OSgrid is running a unique asset management system, called FSAssets, which has not yet been released to the public, said Zetamex CEO Timothy Rogers.

“So this is an issue specifically for OSgrid,” he told Hypergrid Business.

He added that Zetamex uses a different system for the grids it hosts.

“Our version is much like our Zetamex Inventory,” he said. “It does daily self checks and is also backed up every 15 minutes and replicated six times.”

Zetamex system is not the same as the default asset system that comes with OpenSim, but he says he hasn’t seen any asset losses happen there, either.

“The stock assets server is very strong,” he said. “It just gets slower as it gets bigger, which is why we use a custom one. Now I have seen some loss with the XAsset service which is experimental in OpenSim anyways. But stock is fine.”

Sudden shutdowns can also affect databases

Even if a grid’s database is running smoothly, it can still be adversely affected by things like sudden power outages and computer crashes.

These problems can affect even grids without a complex, large scale infrastructure, Avination grid founder and OpenSim core developer Melanie Thielker told Hypergrid Business.

“MySQL and MariaDB have both got weaknesses when the machine they are running on is shut down unexpectedly by a power cut,” she said.

Melanie Thielker

Melanie Thielker

Thielker has been working with OSgrid on their new asset management infrastructure.

“For OSgrid, we added some options that make this less likely,” she said. “But it can still happen that table corruption occurs and a table is marked as crashed, thereby not allowing any access until it it repaired. Likewise, filesystems can get corrupted by power cuts as well.”

Overloaded systems can also damage data, she added, which can sometimes happen when servers are hit by denial-of-service attacks.

She recommends that grid owners have uninterruptable power supplies in areas where power outages are likely, and avoid “hard restarts” of their servers as much as possible.

“Finally, one should install monitoring and alerting tools so issues are noticed right away,” she said.

maria@hypergridbusiness.com'

Maria Korolov

Maria Korolov is editor and publisher of Hypergrid Business. She has been a journalist for more than twenty years and has worked for the Chicago Tribune, Reuters, and Computerworld and has reported from over a dozen countries, including Russia and China.

  • Geir Nøklebye

    It is too early to clear the stock asset server in 0.8.1 entirely. It is probably OK in standalone mode, but for Hypergrid transfers it is not working entirely as it should with nested items (I can point exactly to the commits where the problem arose and still exist.)

    I also experienced direct loss of textures when running 0.8.2.dev.d96d31b – 04-12-2015 Robust on my main grid for half an hour after the to test other functionality that was difficult to test in the test environment.

  • Robert Graf

    It is amazing to see folks who don’t learn from their mistakes. It’s all about communication and letting the chips fall where they may.

    We OSgrid residents are understanding. We are that by our very nature or we wouldn’t have returned to OSgrid. Most region owners used the downtime to experiment with other grids and in some cases running our own hyper connected grid. With search and region/grid promotion tools like opensimworld, etc. there is no need to be on any particular grid. Roll your own.

    And that is what the OSgrid admins risk. OSgrid becoming a metaverse backwater. An abandoned grid. Residents so turned off by the complete lack of CLEAR communication from OSgrid admins that they no longer bother logging in or running regions there. It’s all comes down to opening your mouths and telling people the raw, unadulaterated, complete truth about what is happening.

    It’s not a maintenance issue people… Call it what it is… A failure of the Asset Server which resulted in a months inventory loss. There now. Nice. Clean. Clear. Ripped that bandaid off in one shot. Maybe try that next time. ; )

    • Sammy Greenway

      People have to realize that they have to troubleshoot to find out what it is that is causing the breakage. They can take the time to communicate that they dont know what is wrong, or they can take the time to try to figure out what is wrong.

      • Han Held

        There’s no reason they can’t do BOTH; Firing off a tweet (“we don’t know what’s wrong, we’re investigating, watch this space!”) takes 5 minutes MAX.

        • I just wrote an article about this very topic for CSO magazine:

          http://www.csoonline.com/article/2912434/data-breach/report-it-managers-not-best-leaders-in-breach-crisis.html

          Booz Allen did a study on how companies respond to technology crises. Often, the technology folks wind up in charge of dealing with it. And because they’re technologists, they deal with the technology.

          Their attitude is, “Stop bothering me, I’m trying to fix this!” whenever anyone asks them anything.

          As opposed to thinking about the long-term interests of the company (on, in OSgrid’s case, the non-profit): maintaining customer confidence, maintaining lines of communications, restoring trust, etc…

          Unfortunately, when the entire organization is run by techies — as OSgrid is, since it’s a cutting-edge, developmental grid — not only aren’t there “people” people to step in and run things in time of crisis, but the techies might also not realize that those guys are even necessary.

          Booz Allen talks about another problem that happens when IT guys are in charge – that might make decisions that make sense technologically, but aren’t the best for the business.

          For example, in OSgrid’s case, they shut down the grid for six months while fixing everything. That might have made technical sense.

          But a manager with more of a business or people focus might have opted to keep the grid going without the assets – letting people reload IAR files, etc… — while working on restoring the databases on the side. That way, six months later, if it turned out that it was impossible to restore the databases after all, the grid would have been most of the way back already, while if they DID manage to restore the databases, they could combine the two systems, run a de-duplication routine, and brought the old assets back. It would have taken extra time for the merger, but it would have kept the grid up and running meanwhile.

          • Justanotheravatar

            Can you point me to where OSgrid’s non-profit information is located? I’m not seeing it and I would think it would be displayed prominently.

          • That outlines one of my chief criticisms of Linden Labs, too, from a user standpoint (‘m not tech savvy). In their case, they need to let the geeks be geeks, and hire public relations people to communicate. They would also do better with people with backgrounds in physiology to design the parameters of avatars (they are woefully unrealistic and unnecessarily so . it’s all about what you let users edit and how). Additionally, they should hire people with a background in business so they don’t shoot themselves in the foot (e.g. like passing over the opportunity to buy out SLGo and bring in thousands of new users) or the fundamental concept that selling more sims at a slightly reduced price makes you more money than having a bunch of vacant sims no one will ever buy that actually end up costing you money by sitting there unused. But the irony is that since LL is comprised only of very skilled and specialized IT people, they aren’t even aware there is a problem. To a hammer everything looks like a nail. So maybe the ultimate answer is for someone with a business background to buy out Linden Labs, staff it appropriately, and start to act more strategically.

            Not just picking on LL; as you pointed out, it’s human nature. But it illustrates in the largest terms how this can become problematic.

  • Lani Global

    After OSGrid was rebooted, there were many clear cautionary warnings to users that things were not quite dialed in yet. That, combined with the importance of backing up your work with OARs and IARs, exports, XML, and image saving should be clear to all OpenSim users by now. Modern viewers have many ways to back up your work on your own computer. It is cutting edge Freedom with a capital F.

    • Samantha Atkins

      Lani, with all respect and much admiration, it just isn’t that hard to create a fullproof asset store and maintain it. There really is no good technical excuse. Telling people that if they lost stuff it is there fault doesn’t work well. Yes we can do things to keep ourselves safer, granted. But this is major mismanagement of the grid. We should not pretend it is not. It is dishonest and will not address the problems.

      • Jim Jackson

        Hmmm are there any grids that absolutely guarantee that they will never lose your assets? Is that backed up by a monetary compensation or what? Maybe you could point me to the TOS of such a grid where it states this. Even the all powerful SL with more servers and bandwidth than all of opensim combined, has lost things of mine. I have been told since I created my first region to backup everything over and over. Not just on Osgrid … any grid. If I lose any assets I do blame myself and it does work well for me. Nobody has promised me anything or made any guarantees about anything on any grid. Have you been given such promises? If so, which grid was it? So where is the dishonesty? Do you have documents or other evidence of gross negligence and mismanagement? They consulted with the opensim devs and even had them help with restructuring the grid and asset servers. What else would you have them do? Did you make positive contributions to improve anything? Did you give them the hardware to have more backup? Did you write code for them to make alternate backups in a different mode? Did you start a fundraising effort for them to buy the best backup system in the metaverse? Am I missing something? Do they have unlimited resources that they are hiding? And uhm multibillion dollar companies with the best tech and backups still lose data and of course get hacked. Technology is not perfect even under the best circumstances, though some people like to think so.
        Samantha, you said you are done with OSgrid. Why do you continue to use your energy for negativity? How will that benefit you or anyone else? Please direct your know how and enthusiasm to assembling a physical infrastructure to host a grid the size of OSgrid. When your servers and bandwidth are in place, please open your grid to the public and make it free for anyone to connect their regions to. Make sure you only run experimental test code, none of those safe, boring well tested versions. When you get a consistent user base the size of OSgrid and maintain the grid with such precision and perfection that I am sure you will over a number of years, I will be the first to applaud you. I know since you are running it, not one prim or texture will ever be lost or corrupted. There will never be any downtime. There will never be any inventory issues or permission issues and never be any lag. Oh and please do all this on the same budget that OSgrid has done it and with no pay to any of the staff. I can’t wait! You are so awesome, you are going to have the best grid ever!

        • > What else would you have them do?

          Simply monitor the asset server to detect that the tables did not increase in size, stop and investigate when people reported issues.

          It should not take over a month to realize something was seriously wrong.

          • Jim Jackson

            Nobody is stopping you from creating a grid that size in your spare time and running it better. I look forward to the results. Best of luck.

          • The bigger the grid, the faster it should be evident that the asset table is not growing, but ok 🙂
            I’ll report back to you when I have passed 10k regions.

          • Jim Jackson

            Just to make it fair, make sure you have over 100,000 registered users, some with hundreds of thousands of items in their inventory. All the missing objects were useable up to the time it was reported on April 16th, at least in my inventory. So its not like they were told a month ago that there were missing objects. The article does say that up to 75% of objects were ok, that would seem to indicate that some assets, a sizable amount, were being written to the database.

    • To run the asset server for over a month and not discovering that data isn’t saved is not quite the ticket… Any DBA worth a grain of salt should have spotted it within a few hours.

  • Fly Man

    @LaniGlobal When OSGrid was rebooted, someone should have put down a large sign at every places that claimed “Everything you upload/create might go missing in time” That way they could always point to the sign when someone breaks down (yet again)

  • Samantha Atkins

    Very sad. So no real lessons about data infrastructure were learned and implemented from all that huge downtime. I am done with osgrid. It isn’t worth my time and energy.

    • Sammy Greenway

      They did learn a lesson, they doubled up on servers and implemented a new new data infrastructure. There was a glitch/bug that they did not know was there until they rebooted. As most of us on OSGrid, who have been here for years (me since 2008) there are ups and downs when living on the bleeding edge development grid. If you do backups like you should, this shouldnt be an issue, but a inconvience.

      • You can double the hardware and mirror as much as you want, but if the software writes zero-transactions you still get zero on your shiny new hardware and mirrors. I have seen entire banking systems going down on the belief that doubling hardware will save them.

        So, no, they did unfortunately not learn much.

        They key is having full transaction integrity in your critical writes to the database, and transactions involving assets are the most critical transactions in OpenSim. Without assets, the system is worthless.

        How do you get full transaction integrity? By rewriting your software, using enterprise class database systems that fully supports transactions and can back them out when something goes wrong, and finally have database administrators monitor the databases.

        There is also the human factor – developers more often than not think “their” software is perfect and that users complaining are wrong, nagging or “inexperienced”. The thing every software developer will learn over time is their software will fail, and it will fail in unsuspecting ways. In addition users will find ways of using the software you did not plan for. – So listen to your users.
        When they say that assets are not sticking, or disappearing or rezzing is failing, investigate and investigate fast. EVEN if you have to stop your system to debug or fix it. Otherwise the damage to your users over time can be such they leave you and never look back.

      • Robert Graf

        If the Asset Server is corrupting data. Wouldn’t your backups also be corrupted – OAR’s and IAR’s. So, if that’s the case then backing up does you no good. ; )

  • Penny Lavie

    System administration is an engineering discipline, and as such shares many principles with science, to quote Carl Sagan: ‘Science is more than a body of knowledge. It is a way of thinking’ Clearly something amiss within Opensim culture.
    Community silence does not afford acceptance, rather as many professional administrators have with horror and dismay
    witnessed OSG attempts to manage the system. OSG rank and file if not careful will be permanently labelled Groupies.

  • Lost in Space

    If this was a true open source project they would turn testing over to several grids like Metropolis/Kitely/3rd Rock

    Now use the expenses from before to hire developers also get with a 3rd party who can help plan a road map while creating a leader system so we all have someone to really around.
    I want to see our own version of Ebbe Linden in Open Sim that people rally around who has good leadership skills while bringing open sim back in the news for all the right reasons.
    a multi-grid unified marketing campaign for open sim —- lets rebrand open sim with a new image with one website that explains everything anyone wants to know/knowledge base/facts/downloads/open sim forum all with its own stand alone region..
    people could get the major releases while trying forks like whitecore also filled with oars/terrain files… everything a new person to open sim needs even a built in link to kitely marketplace and integrated designation guide all with sponsors and volunteers to support the ” OPEN SIM HUB- the centerpoint of the Metaverse”

    why? oh why? can this not happen..seems another asset loss is the least worries in the days ahead.

  • Jim Jackson

    First, if you think there is a communication problem, ask yourself if you are doing your best to communicate. If you notice a problem, report it to an admin, post it on the forum, ask for help on the IRC. Don’t rely on one source of information or help on any grid. It helps them and they will help you and determine if its a grid problem or something inherent to the latest opensim release. This works for OSgrid, it works for other grids. I have been treated kindly and with respect by admins on every grid I have been on. Read through the Opensim Mantis and see if your problem is there already and being worked on. Communicate with people who might have more technical knowledge than you, they are probably already aware of your problem. Talk with other region owners to get their input and see if they are experiencing similar problems. Your communication skills and the frequency of your communication will only help you be informed.
    Secondly, backup anything that has the slightest importance to you. Make backups as many ways as you can in whatever format you can. I have lost items in my inventory from other grids. If you think this has only happened on OSGrid, you would be wrong. In one instance my inventory was triplicated making it a complete mess.
    In this case, OARs did work to restore missing items. Any missing items that were rezzed on my regions were restored when loading from OARs. Although they were not in the Osgrid database, they were in my database. Loading the oar reloaded the item to the OSgrid database making it a viable object. As for any other missing objects I created, they are all stored on my computer.
    Other grids, both commercial and free, already have access to the Opensim code and bend it, tweak it, and make their own versions. Some quite successfully and share the improvements.

    No grid is perfect, but I am not going to publicly denigrate or criticize any free open grid. It serves no constructive purpose. They make decisions based on what they feel is best and that is their right. I don’t always agree with the direction a grid is headed, but it is not my grid. Nobody forces me to remain on a grid, I am free to go to another or create my own. I have had regions on several free grids and appreciate them immensely for allowing me to do so.
    If you feel so strongly about what a grid should be, or what Opensim should be, direct your energy towards something constructive. Take the Opensim code, or develop your own from scratch, assemble a team of developers and administrators and make it just the way you want it. Then you can release your test versions or whatever version you want to whomever you want. What is the point of telling someone else how you think they should do things when you have it within your power to do it the way you want it done?

  • Jim Jackson

    You have the knowledge and expertise. Assemble your your team and hardware and run with it. I am sure your results will be spectacular.

  • Susannah Avonside

    The OSGrid asset server is still as bas as it ever was before it crashed. It still takes forever and a day for inventory to load, and frequently stalls. Both the slow loading and stalling makes it impossible to rezz items from inventory, which is frustrating in the least. People want to get inworld and then start doing things with as least delay as possible. I appreciate that a large inventory, (mine’s 45k + items) will take a short while to load, but surely 10 minutes plus is somewhat excessive. I have found that it is often better to Hypergrid to another, European based grid in order to get my OSGrid avie’s inventory to load. Surely this is ridiculous?

    I used to make small monthly financial contributions to OSGrid, but after it had been offline for four months I cancelled those, due to the atrocious approach to keeping the user base informed. My donations now go to the small European based grid I sometimes spend time on – I have a region there. I’ve noticed that now OSGrid are urgently seeking people to donate to them so that they can continue to run. Perhaps they should have thought of this aspect when it was decided that the user base could be kept in the dark. When OSGrid came back online there was initially a flurry of interest, and the numbers inworld appeared to increase, but after the second and continuing fiasco with the asset server, the average numbers of inworld users seems to have halved the previous, pre-crash averages. At that time, there seemed to be an average of 80-90 users inworld at any given time, nowadays it seems to average around 50 or so.

    I used to be enthusiastic about OSGrid, but I am way less so now. I hardly start up the regions I have configured for the grid, and I am seriously considering just running my own standalone regions. I know that this means that I have to manage it all, but at least I know that I won’t have to wait 10 minutes for my inventory NOT to load. It was bad before the crash, and there has been no improvement since it came back online.

    Does OSGrid have a future; there seems to be a lot of disgruntled (ex) residents.

    • I think part of the problem is they have tried to architecture a CDN (content delivery network) without actually having a CDN. In the original setup (also Linden Lab’s) the content was delivered from and cached by the region server, meaning it was distributed over a large number of systems. That is also the case in most small OpenSim grids, where the region server most likely is close to you in addition.

      By serving all up from one central point via http as on OSgrid, you both get the bottleneck of the single asset server itself, and then also the bottleneck of the network pipe between it and you.

      • Susannah Avonside

        Hi Geir. I had thought it was something like that, as HG teleporting to a European based grid with an excellent network connection in order to get an OSGrid avatar’s inventory to load at all, let alone at a half decent rate seemed a bit weird. My inventory takes a while to load in Second Life as it’s almost as big as my OSGrid inventory, but it loads quite quickly by comparison. I know there is a lot of internet routing between me in Wales and the OSGrid asset server in Texas, and that at times there will be congestion, but as OSGrid is decentralised in terms of region connections, could not the asset servers also be likewise decentralised? It would certainly ease the lot of many of us in Europe.

        *I know we could all go and join Metropolis, or one of the many other European based grids, but personally I find the German girds somewhat laggy, so much so I’m beginning to suspect that it might be a fetish thing 🙂 (I also have a presence on the Speculoos grid, based in Belgium (but servers apparently in Germany), but most of the time my presence seems to be the only one there 🙁 )

        • If I remember the presentation correctly, I believe Inworldz have done the grunt work with their nosql asset server approach to support a distributed asset server.

          I think that the reason why it loads when you teleport elsewhere, it backfills the assets from the cache of the local grid server rather than pulling it from OSG. Inventory records would have to come from OSG anyway, but those are small comparatively and transfer fast.

    • Han Held

      I’d encourage you to run your own regions on a standalone; you’ve already done most of the hard part setting up your computer to run OSGrid’s distro. From here you simply have to back up your inventory into an iar file, save your regions to oar files, create a hostname on no-ip or similar, set up a local copy of the diva distro and from there restore your backups. It sounds a lot harder than it really is. (esp once you’ve done the whole thing a few times).

      By running your own standalone you get the full power of opensim, which is the ability for each of us to run OUR OWN GRIDS, controlling everything from top to bottom (inventory, policies, etc). It might seem intimidating now, but once you get going you’ll never want to give it up! 🙂

      • Susannah Avonside

        Hi Han. I have been running my own standalones for a while, starting as soon as I got a decent (i.e. fibre-optic) internet connection some two years ago, but persisted with my regions on OSGrid until last year’s crash. Since OSGrid has been back up I haven’t been that enthusiastic about my regions there, and only run them sporadically. I mostly run my standalones, mainly diva distro vars.

        OSGrid has probably done themselves, and by association the whole OpenSim project a huge disseervice by the way they have treated people in this whole fiasco – nothing seems to have been learned. It’s one thing to piss off n00bs visiting from Second Life who will soon usually vocally complain about the glitches we all take as part and parcel of being on an experimental grid, but when even seasoned users of OSGrid have taken to wearing tags that state ‘Fix the Asset Server’ it’s time someone woke up and smelled the coffee.

        A huge amount could be learned from InWorldz, hardly my favourite grid by a long way, partially because it isn’t HG enabled. In the early days teleporting there was very much luck of the draw and the bendy leg issue was a big one – and even being able to move after a teleport was a bit of a gamble. To be blunt, I was quite surprised to see that grid grow and attract more and more users, most of whom seem remarkably loyal to a grid, that was in it’s early days a mess, to say the least. But InWorldz has always scored on the sense of community, and a community centered on InWorldz, with access, not just to admins, but the actual grid owners. OSGrid has a strong community, but that isn’t centered so much on OSGrid, but more on OpenSim in the widest sense. I think there was a fairly strong and developing sense of an OSGrid community, but all that seems to have gone since the PR fiasco post asset server crash. I am on SL quite often, and I frequently see people there now who before the crash had more or less given up on the Linden Grid.

        • I think you are right that quite a few SL account have gotten a renaissance after the last dataloss on OSG which was totally unnecessary.

          There are really two things to it:

          1. Proceeding without backup (and the mirroring on flimsy disk systems are not backup)

          2. OpenSim 0.8.1 should never have been released in the state it was with major underlying changes to inventory and HG asset transfer virtually untested (I was one of the few who really tested it.) It has created loads of issues around the grids including OSG.

    • Magnuz Binder

      OSgrid average concurrency had been slowly recovering from 53 users logged in at 2013-05-24 to 83 at 2014-08-17. Then the half year outage happened from 2014-08-18 to 2015-02-24. After this, concurrency increased to 90 at 2015-03-15 – 2015-03-28, but has since decreased by 52%, to 43 at 2015-06-07, with 8 down just the last week from 51 at 2015-05-31.

      All concurrency numbers above are trends over the previous 4 weeks extrapolated to the actual date, to exclude short and random fluctuations. Otherwise the drop could have been 60%, from 92 daily average at 2015-03-25 to 37 daily average at 2015-06-05.

      Give it another month and a half with the present rate of decline, and all the OSgrid problems with inventory, asset storage, backups, server costs, needs for donations and management will have solved themselves.

      • Susannah Avonside

        Indeed, and barring something akin to a miracle, I suspect that is what will happen.