Building a Full Campus Infrastructure from Scratch
The Question
How do you manage hundreds of students logging into any computer and always finding their files?
That question got me curious. So I built a small-scale version to understand how it all comes together: 17 Ubuntu servers on DigitalOcean, configured entirely from scratch. No Docker. No managed services. Just Linux, config files, and systemd.
The Architecture
┌──────────────────────────────────────────────┐
│ Server Subnet (10.0.1.0/24) │
│ │
│ DNS ── LDAP ── NFS ── Ansible ── Gitea │
│ Prometheus ── Grafana ── Loki │
│ │
│ WireGuard Hub (infra-fw) │
└─────────────────┬────────────────────────────┘
│ encrypted VPN tunnels
┌─────────────────┴────────────────────────────┐
│ Workstation Subnets (10.1.x.0/24) │
│ │
│ Cluster 1 6 machines (Salle E1) │
│ Cluster 2 4 machines (Salle E2) │
└──────────────────────────────────────────────┘
7 infrastructure servers. 10 workstations. All connected through WireGuard VPN.
DNS Everything Starts Here
DNS is the most underrated service in any infrastructure. Everything depends on it, nobody notices it until it breaks.
I set up BIND9 with a forward zone for tazi.lab and reverse zones for each subnet. Every machine resolves every other machine by name:
dns.tazi.lab → 10.0.1.1
ldap.tazi.lab → 10.0.1.2
nfs.tazi.lab → 10.0.1.3
ansible.tazi.lab → 10.0.1.4
monitor.tazi.lab → 10.0.1.5
e1r1p1.tazi.lab → 10.1.1.1
...
All 17 machines use this DNS server. External queries get forwarded to Cloudflare and Google.
The classic interview question: "A student says internet is not working. What's the first thing you check?" DNS. Always DNS. 90% of "internet is broken" is actually "DNS is broken."
LDAP One Account, Every Machine
Without centralized authentication, 200 students on 200 machines means 40,000 account creations. One password change means updating every machine. That's why LDAP exists.
I deployed OpenLDAP with the following structure:
dc=tazi,dc=lab
├── ou=People
│ ├── uid=yhakkache (student, uid=10042)
│ ├── uid=student01 (student, uid=10001)
│ ├── uid=bocal01 (staff, uid=5001)
│ └── uid=sysadmin (admin, uid=1001)
└── ou=Groups
├── cn=students (gid=1000)
├── cn=bocal (gid=500)
├── cn=admin (gid=100)
└── cn=piscine (gid=2000)
12 users across 4 groups, with SSSD on every workstation for authentication and credential caching. If LDAP goes down briefly during maintenance, recently authenticated users can still log in unlike the old pam_ldap approach where a 30-second outage locks everyone out.
I also installed phpLDAPadmin for web-based account management.
NFS + AutoFS Files Follow the User
Every student's home directory lives on a central NFS server. When they log in on any workstation, AutoFS mounts their home directory on demand and unmounts it after inactivity. No need to mount 200 directories on every machine at boot.
NFS Server (nfs.tazi.lab):
/nfs/home/yhakkache/ ← actual files
/nfs/home/student01/
Any Workstation:
/home/yhakkache/ ← mounted from NFS on login
Disk Quotas
Without limits, one student can fill the entire disk. I set per-user quotas: 5 GB soft limit, 6 GB hard limit enforced by UID on the NFS server.
goinfre
Like in a real campus, each workstation has a local /goinfre directory for temporary large files that don't need to persist across machines.
The UID Problem
NFS uses UID numbers for permissions, not usernames. If LDAP says yhakkache = UID 10042, the files on NFS must be owned by UID 10042. A mismatch means Permission Denied. This is why LDAP and NFS must be perfectly synchronized.
WireGuard Encrypted Communication
All 17 servers communicate through WireGuard VPN in a hub-and-spoke topology. The firewall server (infra-fw) acts as the hub, and all other machines connect through it.
Subnets:
Servers → 10.0.1.0/24
Cluster 1 → 10.1.1.0/24
Cluster 2 → 10.1.2.0/24
Internal services like LDAP (port 389) and NFS (port 2049) only listen on VPN addresses. They are not exposed to the public internet.
Gitea Self-Hosted Git
Students need somewhere to push their code. Instead of depending on GitHub, I set up Gitea a lightweight, self-hosted Git server. Students authenticate with their LDAP credentials and push their work to the internal infrastructure.
Ansible Managing 17 Machines from One Node
Setting up SSSD + NFS + AutoFS on one workstation takes 10 minutes. On 10 workstations, that's where Ansible comes in.
From a single control node, I manage all 17 machines:
ansible all -m ping
→ 17/17 SUCCESS
Playbooks handle everything:
- Deploying
node_exporteron all machines - Deploying
promtailfor log collection - Deploying
fail2banfor SSH protection - Rebooting workstations safely
- Installing packages across the fleet
Ansible playbooks are documentation that executes. The next sysadmin reads the playbook and knows exactly how everything was set up.
Prometheus + Grafana See Everything
Prometheus scrapes metrics from node_exporter on all 17 machines every 15 seconds. Grafana turns those numbers into dashboards.
18 targets monitored (17 machines + Prometheus itself). All UP.
Alert rules fire when:
- A machine goes offline
- CPU stays above 90%
- Disk usage crosses 85%
- Memory is critically low
I went from reactive ("something broke, let me investigate") to proactive ("disk is at 85%, let me fix it before it hits 100%").
Loki + Promtail Centralized Logging
Promtail runs on all 17 machines, shipping syslog, auth.log, and kern.log to Loki on the monitoring server.
This is where things got real. When I queried:
{job="auth"} |= "Failed password"
I found thousands of real SSH brute-force attempts from IPs around the world China, Russia, Brazil, Vietnam. These were not simulated attacks. The servers have public IPs on DigitalOcean, and port 22 is open for remote management.
Loki didn't just collect logs it revealed a real security problem.
fail2ban The Response
After discovering the brute-force attacks through Loki, I deployed fail2ban across all 17 machines using Ansible:
- 3 failed attempts → IP banned for 1 hour
- Scans auth.log in real-time
- Deployed in minutes with one playbook
This is the difference between cloud and on-premise infrastructure. In a physical campus behind a firewall, SSH isn't exposed to the internet. In the cloud, every server has a public IP. fail2ban protects that public SSH layer.
What Happens When a Student Logs In
Here's the full flow everything working together:
- Student types username + password on any workstation
- The workstation resolves
ldap.tazi.labvia DNS - SSSD authenticates against OpenLDAP
- AutoFS mounts their home directory from NFS
- Files are there same as any other machine
- Promtail logs the event, Prometheus records the metrics
- Disk quotas enforce the storage limit
Next day, different machine same files. One user, any workstation, zero setup.
What I Learned
Everything depends on everything
DNS goes down → LDAP can't be found → authentication fails → NFS can't mount → students can't work. One broken service cascades into total failure. Infrastructure is about dependency graphs, not individual tools.
Automation isn't optional
I configured SSSD manually on one workstation in 10 minutes. Then I wrote an Ansible playbook and configured 10 workstations in 2 minutes. At scale, manual configuration isn't slow it's impossible.
Monitoring changes how you think
Without Prometheus, checking a server's health meant SSH-ing in. With Grafana, one dashboard shows the entire infrastructure. Alerting rules turn you from firefighter to engineer.
Logs tell the truth
Setting up Loki felt like a checkbox exercise until it showed me real attacks happening in real-time. Centralized logging isn't about compliance. It's about knowing what's actually happening on your machines.
Security is layers
WireGuard protects internal services. fail2ban protects public SSH. Disk quotas prevent abuse. Monitoring detects anomalies. No single measure protects everything each layer covers the others' blind spots.
The best infrastructure is invisible
When everything works, nobody notices. Students sit down, log in, code, log out. They never think about DNS, LDAP, NFS, or VPN. That's success.
The Stack
| Service | Tool | Purpose |
|---|---|---|
| DNS | BIND9 | Internal name resolution |
| Authentication | OpenLDAP + SSSD | Centralized user accounts |
| File Sharing | NFS + AutoFS | Home directories across workstations |
| VPN | WireGuard | Encrypted server communication |
| Git | Gitea | Self-hosted code repositories |
| Automation | Ansible | Configuration management |
| Monitoring | Prometheus + Grafana | Metrics and dashboards |
| Logging | Loki + Promtail | Centralized log collection |
| Security | fail2ban | SSH brute-force protection |
17 servers. 9 services. Zero managed services. Built from scratch to understand how it all connects.
Building something from scratch is still the best way to actually understand it.