I’ve encountered multiple issues around the kiwiweather.com webserver performance and at the heart of the issue was that I’d tried to implement everything on a single, 10 year old PC running Proxmox and multiple VMs. This was very much taking the low cost option and seeing just how much I could do with it.
The main PC specs are:
- Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz – 4 core / 8 threads
- 24GB RAM (DDR 3)
- Samsung SSD 840 Series – 120GB for Proxmox
- Seagate BarraCuda 2TB – for the VMs
At first, the performance was more than good enough, which is what lulled me into a false sense of security in that the basic solution was working well. But of course over time I added more and more functionality to the solution, especially around the number of satellites I was capturing plus the amount of image processing. This was especially significant with creating multiple videos an hour.
Over time I saw the average time to load the home page go from well under a second to often 10+ seconds at times. This was clearly simply not good enough for anyone to make use of the site.
So the question I needed to understand was “why?”.
I’d previously found that the CPU thermal paste was well past its best, so replacing that helped drop the CPU core temperatures significantly. But that only clawed back some of the performance.
I also found that every time I improved the page load times, more people would use the site until it got to the point where pages were again taking too long to load.
I went down what turned to be a false path when I spotted that on the web server, images are copied to it from multiple files and then need to end up in the correct location. I was originally copying them there and that seemed to be an obvious speed up by moving instead of copying. That was significantly faster as there was a massive decrease in the number of IO operations the hard disk had to do.
But that didn’t help at all in the end, despite the obvious step forward for the IO operations on the web server for that step. Instead I’d made the network performance the bottleneck by each of the feeder VMs / Raspberry Pi having to rsync more data. Ironically this also meant that I’d increased the IO operations too as more data needed to be written to the HDD. So one step forward, two steps back. So I backed out the moving of the files and went to migrating them with rsync (an amazing tool!).
Then as usual, the performance improved, more people accessed the site and I was soon back to where I was.
Clearly IO performance was a key bottleneck I needed to resolve, so I changed the frequency at which I created the videos. Instead of doing so whenever I got a new image used in any of the videos, I moved the generation of them to a schedule. This resulted in each video being rebuilt once an hour, then once every three hours.
Then as usual, the performance improved, more people accessed the site and I was soon back to where I was. This seems to repeat all too often!
Then I thought that it might not just be IO, but also the CPU and RAM which was assigned to each VM running the solution. A big clue here was that if I rebooted the whole server it could take 20 minutes for all the VMs to get back up and running. This was tuned by making the start of each VM have an offset from the others, but it only got so far.
To help with this, I found another old PC and I identified that the most IO heavy part of the solution was the generation of the video files. So I set this up to run Proxmox and created a new VM to do the video creation, along with being the landing pad for all the geostationary images I obtain / capture. And with moving this work to the “new” PC, I could reallocate the now spare CPU cores and memory across the VMs on the original server.
This seemed to be a significant step forward as I’d:
- Separated out the two IO heavy operations, which was the web server and the video creation work to two different PCs
- Boosted the CPU and RAM allocations to these two, plus the other VMs
And this time, performance was significantly improved and it, so far, has stayed like this.
The other improvement was as a result of running out of disk space on the original HDD used for VMs. The obvious solution was to simply add another HDD, however I hit the issue with the PSU only having Molex connectors which is a problem as all modern HDDs are using the SATA style power connection. Luckily I was able to find an adaptor which went from Molex to SATA.
And surprisingly, moving the web server VM to run off its own HDD also made a difference.
So after all that work, I’d take the page load time from 5-10 seconds to about 0.02 seconds, which is a 250-500 fold improvement. This is more than good enough for now.
And doing stuff like this is fun as you can’t just Google the answers for the exact solution you have!