Building a Full Campus Infrastructure from Scratch

Mar 5, 20263 min readLinux

infrastructuredevopshomelab

The Question

How do you manage hundreds of students logging into any computer and always finding their files?

That question got me curious. So I built a small-scale version to understand how it all comes together: 17 Ubuntu servers on DigitalOcean, configured entirely from scratch. No Docker. No managed services. Just Linux, config files, and systemd.

The Architecture

┌──────────────────────────────────────────────┐
│         Server Subnet (10.0.1.0/24)          │
│                                              │
│  DNS ── LDAP ── NFS ── Ansible ── Gitea      │
│  Prometheus ── Grafana ── Loki               │
│                                              │
│          WireGuard Hub (infra-fw)             │
└─────────────────┬────────────────────────────┘
                  │ encrypted VPN tunnels
┌─────────────────┴────────────────────────────┐
│       Workstation Subnets (10.1.x.0/24)      │
│                                              │
│  Cluster 1   6 machines (Salle E1)           │
│  Cluster 2   4 machines (Salle E2)           │
└──────────────────────────────────────────────┘

7 infrastructure servers. 10 workstations. All connected through WireGuard VPN.

DNS Everything Starts Here

DNS is the most underrated service in any infrastructure. Everything depends on it, nobody notices it until it breaks.

I set up BIND9 with a forward zone for tazi.lab and reverse zones for each subnet. Every machine resolves every other machine by name:

dns.tazi.lab      → 10.0.1.1
ldap.tazi.lab     → 10.0.1.2
nfs.tazi.lab      → 10.0.1.3
ansible.tazi.lab  → 10.0.1.4
monitor.tazi.lab  → 10.0.1.5
e1r1p1.tazi.lab   → 10.1.1.1
...

All 17 machines use this DNS server. External queries get forwarded to Cloudflare and Google.

The classic interview question: "A student says internet is not working. What's the first thing you check?" DNS. Always DNS. 90% of "internet is broken" is actually "DNS is broken."

LDAP One Account, Every Machine

Without centralized authentication, 200 students on 200 machines means 40,000 account creations. One password change means updating every machine. That's why LDAP exists.

I deployed OpenLDAP with the following structure:

dc=tazi,dc=lab
├── ou=People
│   ├── uid=yhakkache  (student, uid=10042)
│   ├── uid=student01  (student, uid=10001)
│   ├── uid=bocal01    (staff, uid=5001)
│   └── uid=sysadmin   (admin, uid=1001)
└── ou=Groups
    ├── cn=students  (gid=1000)
    ├── cn=bocal     (gid=500)
    ├── cn=admin     (gid=100)
    └── cn=piscine   (gid=2000)

12 users across 4 groups, with SSSD on every workstation for authentication and credential caching. If LDAP goes down briefly during maintenance, recently authenticated users can still log in unlike the old pam_ldap approach where a 30-second outage locks everyone out.

I also installed phpLDAPadmin for web-based account management.

NFS + AutoFS Files Follow the User

Every student's home directory lives on a central NFS server. When they log in on any workstation, AutoFS mounts their home directory on demand and unmounts it after inactivity. No need to mount 200 directories on every machine at boot.

NFS Server (nfs.tazi.lab):
  /nfs/home/yhakkache/    ← actual files
  /nfs/home/student01/

Any Workstation:
  /home/yhakkache/        ← mounted from NFS on login

Disk Quotas

Without limits, one student can fill the entire disk. I set per-user quotas: 5 GB soft limit, 6 GB hard limit enforced by UID on the NFS server.

goinfre

Like in a real campus, each workstation has a local /goinfre directory for temporary large files that don't need to persist across machines.

The UID Problem

NFS uses UID numbers for permissions, not usernames. If LDAP says yhakkache = UID 10042, the files on NFS must be owned by UID 10042. A mismatch means Permission Denied. This is why LDAP and NFS must be perfectly synchronized.

WireGuard Encrypted Communication

All 17 servers communicate through WireGuard VPN in a hub-and-spoke topology. The firewall server (infra-fw) acts as the hub, and all other machines connect through it.

Subnets:
  Servers      → 10.0.1.0/24
  Cluster 1    → 10.1.1.0/24
  Cluster 2    → 10.1.2.0/24

Internal services like LDAP (port 389) and NFS (port 2049) only listen on VPN addresses. They are not exposed to the public internet.

Gitea Self-Hosted Git

Students need somewhere to push their code. Instead of depending on GitHub, I set up Gitea a lightweight, self-hosted Git server. Students authenticate with their LDAP credentials and push their work to the internal infrastructure.

Ansible Managing 17 Machines from One Node

Setting up SSSD + NFS + AutoFS on one workstation takes 10 minutes. On 10 workstations, that's where Ansible comes in.

From a single control node, I manage all 17 machines:

ansible all -m ping
  → 17/17 SUCCESS

Playbooks handle everything:

Deploying node_exporter on all machines
Deploying promtail for log collection
Deploying fail2ban for SSH protection
Rebooting workstations safely
Installing packages across the fleet

Ansible playbooks are documentation that executes. The next sysadmin reads the playbook and knows exactly how everything was set up.

Prometheus + Grafana See Everything

Prometheus scrapes metrics from node_exporter on all 17 machines every 15 seconds. Grafana turns those numbers into dashboards.

18 targets monitored (17 machines + Prometheus itself). All UP.

Alert rules fire when:

A machine goes offline
CPU stays above 90%
Disk usage crosses 85%
Memory is critically low

I went from reactive ("something broke, let me investigate") to proactive ("disk is at 85%, let me fix it before it hits 100%").

Loki + Promtail Centralized Logging

Promtail runs on all 17 machines, shipping syslog, auth.log, and kern.log to Loki on the monitoring server.

This is where things got real. When I queried:

{job="auth"} |= "Failed password"

I found thousands of real SSH brute-force attempts from IPs around the world China, Russia, Brazil, Vietnam. These were not simulated attacks. The servers have public IPs on DigitalOcean, and port 22 is open for remote management.

Loki didn't just collect logs it revealed a real security problem.

fail2ban The Response

After discovering the brute-force attacks through Loki, I deployed fail2ban across all 17 machines using Ansible:

3 failed attempts → IP banned for 1 hour
Scans auth.log in real-time
Deployed in minutes with one playbook

This is the difference between cloud and on-premise infrastructure. In a physical campus behind a firewall, SSH isn't exposed to the internet. In the cloud, every server has a public IP. fail2ban protects that public SSH layer.

What Happens When a Student Logs In

Here's the full flow everything working together:

Student types username + password on any workstation
The workstation resolves ldap.tazi.lab via DNS
SSSD authenticates against OpenLDAP
AutoFS mounts their home directory from NFS
Files are there same as any other machine
Promtail logs the event, Prometheus records the metrics
Disk quotas enforce the storage limit

Next day, different machine same files. One user, any workstation, zero setup.

What I Learned

Everything depends on everything

DNS goes down → LDAP can't be found → authentication fails → NFS can't mount → students can't work. One broken service cascades into total failure. Infrastructure is about dependency graphs, not individual tools.

Automation isn't optional

I configured SSSD manually on one workstation in 10 minutes. Then I wrote an Ansible playbook and configured 10 workstations in 2 minutes. At scale, manual configuration isn't slow it's impossible.

Monitoring changes how you think

Without Prometheus, checking a server's health meant SSH-ing in. With Grafana, one dashboard shows the entire infrastructure. Alerting rules turn you from firefighter to engineer.

Logs tell the truth

Setting up Loki felt like a checkbox exercise until it showed me real attacks happening in real-time. Centralized logging isn't about compliance. It's about knowing what's actually happening on your machines.

Security is layers

WireGuard protects internal services. fail2ban protects public SSH. Disk quotas prevent abuse. Monitoring detects anomalies. No single measure protects everything each layer covers the others' blind spots.

The best infrastructure is invisible

When everything works, nobody notices. Students sit down, log in, code, log out. They never think about DNS, LDAP, NFS, or VPN. That's success.

The Stack

Service	Tool	Purpose
DNS	BIND9	Internal name resolution
Authentication	OpenLDAP + SSSD	Centralized user accounts
File Sharing	NFS + AutoFS	Home directories across workstations
VPN	WireGuard	Encrypted server communication
Git	Gitea	Self-hosted code repositories
Automation	Ansible	Configuration management
Monitoring	Prometheus + Grafana	Metrics and dashboards
Logging	Loki + Promtail	Centralized log collection
Security	fail2ban	SSH brute-force protection

17 servers. 9 services. Zero managed services. Built from scratch to understand how it all connects.

Building something from scratch is still the best way to actually understand it.