Finite Computational Resources

Aug. 2, 2024 • 3 minute read
DjangoAnsibleCachingAWSLinodePerformance

Ever since I started my programming journey, most of the code that I have written has been run on my own computer. Sure it was in different formats and operating systems, but in general it was my own little black box that I had full control over.

When I started pushing this website to a server it was fine at first and the initial machine the server was hosted on which was a 2 core and 2 gigabyte machine on Linode. As I was chasing the cheapest hosting this was reasonable due to having promotional credits. Even though this was much smaller than my development system I wasn't running into much issues.

This was hosting my entire stack as well with the main web server, database and other supporting containers while not really giving me problems with speed or responsiveness.

Switching to the AWS free tier flipped this completely on its head surfacing quiet a few pain points due to the system memory getting cut in half.

EC2 doesn't setup swap memory and that needs to be done manually. This would completely cripple responsiveness initially and cause system freezes before I found the issue.

CI/CD system had problems as I would previously build my docker containers on the target system instead of remotely and the build process would run out of memory and slow down to a halt. Flipping this around and building on my development machine and transferring the containers with my Ansible deployment playbook to the target server fixed the problem.

When all the containers were running the actual memory was very closely capped out. Trimming my supporting containers and moving my database to a free RDS instance instead helped free up a few 20% of total memory usage. Curbing the amount of web workers also helped a tiny bit which allowed for more optimal CPU usage as well.

Unexpectedly increasing caching behaviour gave the biggest performance upgrade even though the reduction of memory would point in the opposite direction. After some more digging I found that EC2 instances had a concept of CPU credits which would limit the amount of computational throughput an instance could have.

The equation for CPU credit usage from the EC2 documentation would be:

\(1CPUCredit = 1vCPU \cdot 100\% \cdot \text{utilization minutes}\)

Saving all this IO plus computation time, as each page doesn't need to regenerated and there use less CPU time, while also regenerating some of those credits over time due to idling:

\(\text{Credits earned per hour} = \% \text{baseline utilization} \cdot \text{number of vCPUs} \cdot 60 \text{ minutes}\)

After these optimisations the responsiveness is at a reasonable state 99.9% of the time. In hindsight these kind of issues will always occur and are best observed with minimal resources.