Infrastructure
A cloud for a cool person (me)
January 2021 - Now
Source:
This project represents my unified efforts to manage all my software configurations and deployments across all my machines.
Principles
In designing and building my homelab, I try to adhere to several core principles as much as I possibly can:
- Infrastructure as Code (IaC). If a configuration can be expressed as repeatable, reproducible code, it probably should be.
- Automation. If I keep doing the same thing over and over again, I should probably automate it.
- Emphasize security. Getting hacked is scary and has real-life consequences in the real world. So, I want to avoid getting hacked as much as possible by taking a mildly paranoid approach to security.
- Use standard tools as much as possible. This lab is here for me to learn how people in the industry do things. Just as dev should mirror prod, my attempts here should mirror actual corporate environments.
Current Infrastructure Setup
I’m currently at college, but I’ve brought my homelab with me. This means my homelab spans two physical sites!
Networking
DNS
I own many domain names, and they can become somewhat of a burden to maintain. As such, I have them all centrally managed through Cloudflare. Furthermore, to improve auditability and adhere to IaC best practices, I manage these domains through Terraform. They are automatically deployed by Terraform Cloud on every push to the monorepo.
Cascaded Routers
This is a setup I use whenever I have roommates. We have a router for the rest of the house, and I have a router specifically for my room. Just like Texas with ERCOT!
The reason I do this is so that I don’t accidentally break the rest of the LAN with my shenanigans. In other words, I expect that I’ll end up like Texas, but I’m trying to prevent the problems from reaching everyone else.
Now, this does make setup slightly more of a hassle, especially with the problem of port forwarding, but that’s acceptable.
List of sites
- Site 0: This is my home in the Bay Area. Much of the equipment has been migrated to SLO, so there is not much equipment here. It has a cascaded router setup.
- Site 1: This is my apartment in San Luis Obispo (SLO) where I go to school. It also has a cascaded router setup.
Personal Computers
BANANA
This is my usually-stay-at-home laptop with the following specs:
- Hostname: banana.id.astrid.tech
- Location: Wherever I go, but usually Site 1
- Model: Lenovo Legion Y530-15ICH-1060
- OS: Dual Boot: NixOS, Windows 10
- CPU: Intel i5-8300H (8 core)
- RAM: 32GiB
- GPU: NVIDIA GeForce GTX 1060 Mobile
- Monitors: 1920x1080 built-in + 2560x1440, 3840x2160
- Storage: 1TB nVMe SSD + 500GB SATA SSD
- Windows is installed on the 1TB nVMe
- NixOS is installed on an encrypted ZFS pool on the 1TB nVMe
- /home is on an encrypted ZFS pool on the 500GB SSD
- VM Guests
- winana: A Windows 10 VM for when I’m too darn lazy to reboot into full Windows 10.
- parrot-htb: A Parrot OS VM for working on HackTheBox puzzles.
I do not intend to upgrade this to Windows 11.
Cracktop
Cracktop is my travel laptop that I bring to and from school. It was my old laptop from high school.
- Hostname: cracktop-pc.id.astrid.tech
- Location: Wherever I go
- Model: HP Pavilion 13 x360
- OS: NixOS Unstable
- CPU: Intel i5-6300U (4 core)
- RAM: 8GiB
- Monitors: 1920x1080
- Storage: 128GB M.2 SATA SSD
There are a couple reasons why I use it despite its cracked screen:
- It’s a lot lighter than BANANA, which reduces the load in my backpack.
- Campus has a bike theft problem, so I wouldn’t be surprised if it had a device theft problem as well. If I lose this machine, I won’t be too sad, and with the cracked screen, no one would want to steal it.
Workload Servers
Bongus
This server was an absolute steal I got off of eBay for $200.
- Hostname: bongus-hv.id.astrid.tech
- Location: Site 1
- Model: HP ProLiant DL380P Gen8
- OS: NixOS Unstable
- CPU: 2x Intel Xeon (2x8 phys. core, 2x16 virt. core)
- RAM: 96GiB
- Storage: 128GB SATA SSD + RAID1 2x 1TB Used HDD
- NixOS is installed on an encrypted ZFS pool on the 128GB SSD
- The RAID1 HDDs are used for ZFS datasets containing miscellaneous data
Unfortunately, it eats a lot of power, so I’m only turning it on sporadically when I need to heat my room.
Dedicated Devices
Entire bare-metal devices dedicated to a single purpose. These are usually low-power single-board computers (SBCs) that are always on, but have a workload that essentially requires the entire device.
APRS I-Gate
A dedicated device for decoding and uploading APRS signals around Site 0. Created because there is poor APRS coverage where I live.
- Location: Site 0
- Model: Orange Pi One
- OS: Armbian
- CPU: Allwinner H3 (4 core)
- RAM: 512MB
- Peripherals
- RTL-SDR (USB 2.0)
Jump Servers
Jump servers are Raspberry Pis with an SSH server exposed to the Internet. These are used to send Wake-on-LAN packets to other machines. Both Sites 0 and 1 are equipped with one of these:
- jonathan-js: Raspberry Pi 3 at Site 1
- joseph-js: Raspberry Pi 3 at Site 0
However, I intend to set up Wireguard VPN so I don’t have to expose them, so I can relegate these to a Wake-on-LAN-plus-other-stuff role.
Public Cloud
I have 2 Always Free VPSes in Oracle Cloud. Only 1 is in use right now. The other will eventually become my Wireguard VPN lighthouse.
oci1
I run the astrid.tech backend on this one. It is managed through Docker Compose, and refreshed hourly using an Ansible Playbook hosted in Github Actions.
Future Expansion Plans
- I plan on setting up a Wireguard VPN, with an Oracle Cloud VPS as the primary connection endpoint.
- I want to set up a new Kubernetes cluster. Don’t let all those YAMLs from the past go to waste, after all!
History
This is the history of my attempts at system administration.
v0 - Early Forays
In late 2017, I was interested in trading Bitcoin. But not just going about it in a boring way; I wanted to algorithmically trade it, and use computers to calculate the optimal trading strategy. But I didn’t just want to calculate that strategy the normal way, either, I wanted to write a genetic algorithm to do it. And it didn’t just stop there, I wanted to run it on multiple computers to do parallel processing.
So, I spent $50 on a pair of Dell Optiplexes from eBay, hooked them into my LAN, installed Ubuntu Server on them, and tried to set up Apache Spark on them. What could possibly go wrong?
The issue is, at the time, I was more of a programmer, and I saw deployment as an implementation detail that could be sorted out later. So, I did it the very quick and dirty way (wget the zip, call it via SSH) without really learning about modern best practices.
Well, what ended up happening was I never got any of that working. I didn’t know about tools like Ansible, Docker, or Nix at the time, and deploying in the prod environment over SSH was every bit as frustrating and tedious as I thought it would be. Additionally, it turns out you can’t just make a modern algotrader only relying on technical indicators, so that failed too. However, this experiment did set the stage for my future DevOps endeavors.
v1 - VPS Deployment
In late December 2020, I was writing the backend for astrid.tech, and I came across the problem of “how do I deploy this?” And that’s when I learned Docker, and my VPS had excess capacity, so I hosted some other stuff besides my backend using Docker.
I consider this v1 of my homelab because it was something actually functional
for a while. Although the service declarations were in a sort of modular Docker
Compose architecture, they were all updated manually by SSHing in and
essentially running git pull && docker-compose up
.
Here is the last version of the configs
before I incorporated it into the rest of my monorepo.
v2 - On-Premises Cloud
Running a budgeting application like Firefly III in a public cloud alongside my website backend (a prime target for hacking!) seemed like a bit of a liability. So, I wanted to move that to a private, on-site cloud composed of multiple computers. It seemed awkward to manually allocate services to specific ones, so that led me to learn Kubernetes.
It was mostly set up manually, with a few Ansible scripts to semi-automate tasks like upgrading software, and a few Terraform configs to create databases. Here is what the infra repo looked like by the time I set up v2.
On-Site Hardware
Here is a listing of hardware I worked with:
Name | Model | Arch | Processor | Cores | RAM (GB) | Role |
---|---|---|---|---|---|---|
crappertop1 | Acer Aspire E1-510 | amd64 | Pentium N3520 | 4 | 4 | Proxmox: k3s, nfs |
cracktop2 | HP Pavilion x360 13 | amd64 | i5-2520M | 4 | 8 | Proxmox: k3s |
thonkpad3 | Thinkpad T420 | amd64 | i5-6200U | 4 | 8 | Proxmox: k3s, db, nfs |
zerg-14 | Raspberry Pi 3B+ | arm | BCM2837B0 | 4 | 1 | k3s |
zerg-24 | Raspberry Pi 3B | arm | BCM2837 | 4 | 1 | k3s |
zerg-34 | Raspberry Pi 2B+ | arm | BCM2836 | 4 | 1 | k3s |
zerg-44 | Orange Pi One | arm | sun8iw7p1 | 4 | 0.5 | k3s |
Total | — | — | — | 28 | 23.5 | Wasting my Time |
Public Cloud
I continued to use public cloud resources. I ran a Minecraft server on Contabo for a time, and I continued to run parts of v1 stack on oci-1.
Name | Provider | Price ($/mo) | Arch | Processor | Cores | RAM (GB) | Role |
---|---|---|---|---|---|---|---|
contabo | Contabo | 7.90 | amd64 | Xeon something | 4 | 8 | Docker Compose |
oci-1 | Oracle | 0 | amd64 | Xeon something | 1 | 1 | Docker Compose |
oci-2 | Oracle | 0 | amd64 | Xeon something | 1 | 1 | Docker Compose |
Total | — | 7.90 | — | — | 6 | 8 | Wasting my Time |
Infrastructure Services
These services were deployed on the on-site hardware.
Name | Description | Deployed on |
---|---|---|
Proxmox | An open-source Type 1 Hypervisor OS | Bare metal |
K3s | A lightweight Kubernetes distribution that won’t eat up most of the resources on a Raspberry Pi or craptop | VM, bare metal Raspberry Pis |
MySQL/MariaDB | Database | LXC |
Postgres | Database | LXC |
Docker Compose | Multi-container applications stack, useful for servers dedicated to a single purpose | Bare metal |
NFS | File storage for specific Kubernetes services | LXC |
Samba | File and OS image storage | LXC |
End-User Services
These were the services I actually ran, and some I planned on running but never ended up getting done in v2.
Name | Status | Description | Deployed on |
---|---|---|---|
OctoPrint | Deployed | 3D Printer sender with web UI | Bare Metal |
Firefly III | Deployed | Budgeting App | k3s |
Printer Image Snapper | Deployed | Periodically takes pictures of my 3D Printer and uploads them to the internet | k3s |
D&D Dokuwiki | Deployed | A wiki for worldbuilding my D&D campaign | Docker Compose |
Trilium Notes | Deployed | Personal wiki/note-taking for school and more | k3s |
Apache Spark | Planned | Big Data processing engine | k3s |
Deluge | Planned | Torrenting server | k3s |
Jupyter Notebook | Planned | Interactive code notebooks | k3s |
Bookstack | Planned | “Internal” wiki for documenting this thing | k3s |
ELabFTW | Planned | Lab notes | k3s |
NextCloud | Planned | Personal cloud | k3s |
Of course, in Kubernetes, every service gets its own Ingress. However, these are internal services so it seems like somewhat of an antipattern to add A records pointing to 192.168.xxx.xxx in Cloudflare. My solution to this was just to add janky entries to my /etc/hosts and “fix it later” (in v2 of my homelab):
192.168.1.xxx firefly.astrid.tech grafana.astrid.tech prometheus.astrid.tech ...
Monitoring Services
And finally, these were the services I used to monitor all of the above.
Name | Type | Description | Deployed on |
---|---|---|---|
Fluent-bit | Logs | Reads logs from each node’s containers | k3s, Docker Compose |
Fluentd | Logs | Centrally parses and processes logs | k3s |
Loki | Logs | Stores and indexes logs | k3s |
Prometheus | Metrics | Stores and indexes metrics | k3s |
Grafana | Visualization | Graphs and visualizes everything! | k3s |
Reflection
This was my first foray into Kubernetes, and with a homelab, so despite all the learning curves, I think I did great here! However, there were a lot of things that I could have done better with this setup.
- Ephemeral Kubernetes volumes. I didn’t have any centralized storage, or some kind of sane storage management, so whenever a Kubernetes pod died, it would lose all its data.
- Mixed-architecture clustering is hard. You may notice I had both ARM and x86 machines. Some Docker images only support one architecture at a time. It’s very hard to do this. I do not recommend it.
- Low-end machines could not support virtualization. It was a stupid idea to run Proxmox on badtop with its Pentium and 4GB RAM.
- No domain controller. I wanted to set up FreeIPA, but I didn’t have the resources to do it.
v3 - Kubernetes-Focused Cloud
I decided to tear down my homelab and start anew, to fix all the issues I had with v2. This included reinstalling Proxmox, as well.
This time, I had a similar stack, but with a few critical differences, making my Kubernetes setup less painful.
On-Site Hardware
I dropped all the ARM machines. Mixed-architecture is too hard. Additionally, for some machines, I installed k3s on bare metal.
Name | Model | Arch | Processor | Cores | RAM (GB) | Role |
---|---|---|---|---|---|---|
badtop | Acer Aspire E1-510 | amd64 | Pentium N3520 | 4 | 4 | Bare Metal: k3s |
cracktop | HP Pavilion x360 13 | amd64 | i5-2520M | 4 | 8 | Proxmox: k3s |
thonkpad | Thinkpad T420 | amd64 | i5-6200U | 4 | 8 | Proxmox: k3s, FreeIPA |
deskrap | Dell Optiplex 360 | amd64 | Intel Q???? | 4 | 3 | Bare Metal: k3s |
Infrastructure Services
Note that this time I actually managed DNS! This was done by having External DNS update FreeIPA’s server with Kubernetes Ingress entries. See this post for more information.
Name | Description | Deployed on |
---|---|---|
Proxmox | An open-source Type 1 Hypervisor OS | Bare metal |
K3s | A lightweight Kubernetes distribution | VM, bare metal |
FreeIPA | An all-in-one package for Identity, Policy, and Audit | VM |
Longhorn | A distributed storage solution for Kubernetes | Kubernetes |
KubeDB | A Kubernetes operator that manages databases | Kubernetes |
External DNS | Adds Kubernetes Ingress entries to Cloudflare and FreeIPA DNS | Kubernetes |
FreeIPA managed the s.astrid.tech
and p.astrid.tech
namespaces, where s
stands for service and p
stands for private. I would register FreeIPA
clients on the p
namespace, and internal services on the s
namespace (like
longhorn.s.astrid.tech
, firefly.s.astrid.tech
, grafana.s.astrid.tech
…)
End-User Services
Name | Status | Description | Deployed on |
---|---|---|---|
OctoPrint | Deployed | 3D Printer sender with web UI | Bare Metal |
Firefly III | Deployed | Budgeting App | k3s |
Printer Image Snapper | Deployed | Periodically takes pictures of my 3D Printer and uploads them to the internet | k3s |
ELabFTW | Deployed | Lab notes | k3s |
Homer | Deployed | Homepage linking together all of my services | k3s |
D&D Dokuwiki | Deployed | A wiki for worldbuilding my D&D campaign | k3s |
Matrix Homeserver | Planned | Self-hosted chat app | k3s |
Jellyfin | Planned | Media server | k3s |
Samba | Planned | A fileshare for all my files, it would connect to Jellyfin | LXC |
Deluge | Planned | Torrenting server | k3s |
Trilium Notes | Planned | Personal wiki/note-taking for school and more | k3s |
Jupyter Notebook | Planned | Interactive code notebooks | k3s |
Bookstack | Planned | “Internal” wiki for documenting this thing | k3s |
NextCloud | Planned | Personal cloud | k3s |
Apache Spark | Planned | Big Data processing engine | k3s |
Monitoring
I used the same exact stack as in v2 with minimal modifications.
Reflection
There were lots of good things about v3. Some of the good things were:
- Kubernetes storage was a lot simpler now that I had Longhorn.
- I ended up deploying more services and it was a lot nicer to deploy them.
However, there were some bad things as well:
- You may notice that I dropped a lot of services. I planned to deploy them but never ended up doing it.
- The FreeIPA VM ate a lot of RAM. I didn’t have as much capacity to deploy my services as a result.
- Most of this was still manually set up. The Kubernetes cluster, for example,
was manually set up using
k3sup
. - My Oracle Cloud VPS was still deploying using v1!
v4 - Attempts at fully-automated deployment
I tore down my homelab again. This time, I wanted to automate as much as possible. My primary goals were:
- after installing the OSes on the bare metal machines, I can set all of them up by executing a single command
- to create a more GitOps-like workflow where I can automatically deploy
everything, simply by pushing my configs to
main
Experiments with Ansible-based automated infrastructure bootstrapping
My first attempts revolved around a core idea: what if I had a central Ansible playbook that would do everything for me?
To reduce configuration drift, I also attempted to set up Packer. The rest of this section is coming soon…
NixOS!
I don’t exactly remember when I first found out about Nix, but it was sometime last year. It seemed like an interesting concept, but I didn’t use it as anything more than a package manager with unstable/bleeding-edge packages. At some point, I wanted to distrohop BANANA (it was on Ubuntu at the time). Arch Linux and NixOS were my top two candidates to hop to.
Unfortunately, I had an extremely weird partitioning scheme involving Windows dual-boot, ext4, and a strange LVM topology, so I couldn’t figure out how to configure Nix to work with it at the time. Additionally, I didn’t want to spend much time learning the Nix language at that moment as I was lazy, and I was more interested in having a functional5 computer again. I ended up installing Arch, and it seems to mostly work!
However, while researching how to automatically bootstrap and update my cluster, I met Vika in the IndieWeb IRC. She told me about her Nix-based setup, and I realized that NixOS was perfect for what I was trying to do!
So, I turned my infra repo into a Nix flake, installed NixOS on Bongus, and that leads us to my current setup.
-
My mom complained about it being really slow, and even with Linux, it’s still slow. Thus, it’s worse than crap. ↩
-
This is my old laptop. I pulled it out of my backpack one day and saw the screen completely cracked for no good reason whatsoever. ↩
-
:thinking: https://emoji.gg/assets/emoji/thonk.png ↩
-
Zerg rushing is essentially the “quantity-over-quality” tactic in strategy games: you send many expendable units against the enemy to hopefully overwhelm them. ↩ ↩2 ↩3 ↩4
-
No, not like Nix functional, I mean working functional. ↩