Add network boot testing and monitoring documentation
Enhance the NETWORK-BOOT-LXC.md documentation with detailed steps for testing network boot functionality, including prerequisites, expected outcomes, and quick testing methods. Introduce a new section on monitoring network boot status on the LXC, outlining commands to check DHCP leases, dnsmasq status, and registered devices. Update the initramfs scripts to support a rescue mode for devices stuck in network-only boot, allowing users to change boot order settings. Include a new rescue script for eMMC management in the build process.
This commit is contained in:
@@ -97,6 +97,47 @@ cat /var/lib/misc/dnsmasq.leases
|
||||
|
||||
Each line is: *expiry_epoch MAC IP hostname client_id*. Example: `1734567890 aa:bb:cc:dd:ee:ff 10.20.50.101 reterminal 01:aa:bb:cc:dd:ee:ff`
|
||||
|
||||
---
|
||||
|
||||
## Testing network boot
|
||||
|
||||
1. **Prerequisites**
|
||||
- reTerminal has **BOOT_ORDER=0x21** (network first). Check on the device:
|
||||
`ssh pi@<device-ip> 'bash -s' < emmc-provisioning/scripts/check-network-boot-priority.sh`
|
||||
- LXC network-boot options are **enabled**: on the LXC run
|
||||
`/opt/cm4-provisioning/toggle-network-boot-dhcp.sh status` → should print `enabled`. If not: `toggle-network-boot-dhcp.sh enable`
|
||||
- reTerminal is on the **same LAN as the LXC’s eth1** (e.g. 10.20.50.0/24), Ethernet connected.
|
||||
|
||||
2. **Power cycle the reTerminal** (or reboot if it’s already running). It will request DHCP, get options 66/67 (TFTP server + boot file), then TFTP boot files from the LXC.
|
||||
|
||||
3. **What “working” looks like**
|
||||
- **On the LXC**: a new lease appears in `/var/lib/misc/dnsmasq.leases` (device MAC + IP in 10.20.50.x).
|
||||
- If the netboot environment runs **provisioning-client.sh** and registers with the dashboard: the device appears under **“Device detected (Network)”** on the dashboard (`http://<LXC-IP>:5000`), and you can choose Backup/Deploy.
|
||||
- If you only use “plain” Pi netboot (no custom initramfs/provisioning client): you just see the DHCP lease and the device loading files via TFTP; it may boot to a minimal kernel/initramfs or NFS root depending on your TFTP config.
|
||||
|
||||
4. **Quick test without a reTerminal**
|
||||
- From a Linux host on the same VLAN as eth1, run:
|
||||
`sudo dhclient -v eth0` (or your interface) and check that you get an IP in 10.20.50.x and, if netboot is enabled, that the DHCP reply includes option 66 (next-server) and 67 (boot file).
|
||||
- Or on the LXC run `tcpdump -i eth1 -n port 67 or port 68` and power on the reTerminal: you should see DHCP (Discover/Offer/Request/Ack) and then TFTP traffic.
|
||||
|
||||
---
|
||||
|
||||
## Monitoring on the LXC
|
||||
|
||||
| What to check | How |
|
||||
|--------------|-----|
|
||||
| **Network boot enabled?** | ` /opt/cm4-provisioning/toggle-network-boot-dhcp.sh status` → `enabled` or `disabled` |
|
||||
| **DHCP leases** | `cat /var/lib/misc/dnsmasq.leases` — lists MAC, IP, hostname for devices that got an IP from dnsmasq on eth1 |
|
||||
| **dnsmasq (DHCP/TFTP) running** | `systemctl status dnsmasq` or `service dnsmasq status` |
|
||||
| **TFTP root present** | `ls -la /srv/tftpboot/` — should contain e.g. `start4cd.elf`, `fixup4cd.dat`, `config.txt`, `kernel8.img` |
|
||||
| **Live DHCP/TFTP traffic** | `tcpdump -i eth1 -n port 67 or port 68 or port 69` (67/68 = DHCP, 69 = TFTP). Run while powering on a device. |
|
||||
| **Dashboard – network devices** | Open `http://<LXC-IP>:5000`; under “Device detected (Network)” you see devices that have called `POST /api/register-device` (only if your netboot environment runs the provisioning client). |
|
||||
| **Registered devices (raw)** | `cat /var/lib/cm4-provisioning/network_devices.json` (if the dashboard uses default path) — list of MAC, IP, action. |
|
||||
|
||||
Optional: enable dnsmasq query logging to see every DHCP request. Add to a config in `/etc/dnsmasq.d/` (e.g. `log-queries.conf`): `log-queries` and `log-facility=/var/log/dnsmasq.log`, then create the log file and `systemctl reload dnsmasq`. Check your distro’s dnsmasq doc for log location.
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
| Component | Where | Purpose |
|
||||
|
||||
195
emmc-provisioning/docs/NETWORK-BOOT-TROUBLESHOOTING.md
Normal file
195
emmc-provisioning/docs/NETWORK-BOOT-TROUBLESHOOTING.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Network boot troubleshooting: no DHCP/TFTP during boot, only after OS is up
|
||||
|
||||
If you run **tcpdump** during power-on but see **no DHCP/TFTP traffic during boot**, and only see traffic **after** the device has booted to the OS, the reTerminal is almost certainly **not on the same L2 segment as the LXC's eth1**.
|
||||
|
||||
## What’s going on
|
||||
|
||||
- The Pi’s **bootloader** (EEPROM) sends DHCP Discover on the Ethernet port when it tries network boot.
|
||||
- That request only reaches interfaces on the **same VLAN / same bridge** (same cable/switch segment).
|
||||
- dnsmasq in the LXC listens only on **eth1** (provisioning LAN).
|
||||
- If the reTerminal is plugged into the **main office LAN** (or the same segment as the LXC’s **eth0**), the netboot DHCP **never reaches eth1** — so you see no DHCP/TFTP on eth1 during boot.
|
||||
- After the OS boots, it uses the same Ethernet port and gets an IP from the main LAN; you then see traffic (e.g. on eth0 or from the device’s new IP). That’s why you only see traffic “after the device boots to OS”.
|
||||
|
||||
## What to do
|
||||
|
||||
### 1. Confirm which interface sees the boot-time DHCP
|
||||
|
||||
On the LXC, run tcpdump on **both** interfaces in two terminals (or run one in background):
|
||||
|
||||
```bash
|
||||
# Terminal 1: provisioning LAN (where netboot should happen)
|
||||
tcpdump -i eth1 -n -e port 67 or port 68 or port 69
|
||||
|
||||
# Terminal 2: WAN / main LAN
|
||||
tcpdump -i eth0 -n -e port 67 or port 68 or port 69
|
||||
```
|
||||
|
||||
Then **power off** the reTerminal and **power it on**. Watch where DHCP (and TFTP) appear:
|
||||
|
||||
- If you see DHCP **only on eth0** during boot → the reTerminal is on the same segment as **eth0**, not eth1. So netboot is not using your LXC’s dnsmasq; the device may get an IP from another DHCP server and fall back to eMMC boot.
|
||||
- If you see DHCP **on eth1** during boot → the reTerminal is on the provisioning segment; you should then see TFTP (port 69) as well.
|
||||
|
||||
### 2. Fix: put the reTerminal on the same segment as eth1
|
||||
|
||||
- The reTerminal’s Ethernet cable must be connected to the **provisioning** segment: the same VLAN or bridge as the LXC’s **eth1** (e.g. 10.20.50.0/24).
|
||||
- On Proxmox, eth1 is often on a **dedicated bridge** (e.g. `vmbr1`). The reTerminal must be plugged into a switch port that belongs to that same bridge/VLAN.
|
||||
- If you have one physical switch: either put the LXC’s eth1 and the reTerminal in the same VLAN, or use a dedicated “provisioning” port group / switch.
|
||||
|
||||
### 3. Sanity check: same port as reTerminal
|
||||
|
||||
- Plug a **laptop** (or another device) into the **same port** (or same VLAN) as the reTerminal.
|
||||
- Run: `sudo dhclient -v <interface>` (or let it get DHCP automatically).
|
||||
- If you get an IP in **10.20.50.x** → that segment is your provisioning LAN (eth1); the reTerminal should netboot from there.
|
||||
- If you get a different range (e.g. 192.168.x.x) → that segment is **not** the provisioning LAN; move the reTerminal’s cable or VLAN to the segment where 10.20.50.x is served.
|
||||
|
||||
## Summary table
|
||||
|
||||
| Symptom | Likely cause | Action |
|
||||
|--------|---------------|--------|
|
||||
| No DHCP/TFTP on eth1 during boot; traffic only after OS | reTerminal on different segment than eth1 | Plug reTerminal into same VLAN/bridge as LXC eth1 (provisioning LAN) |
|
||||
| DHCP on eth0 during boot, none on eth1 | reTerminal on same segment as eth0 | Move reTerminal to provisioning segment (same as eth1) |
|
||||
| No DHCP on any interface during boot | Cable unplugged, BOOT_ORDER not 0x21, or device not attempting netboot | Check cable, confirm BOOT_ORDER=0x21, power cycle with cable in before power |
|
||||
|
||||
---
|
||||
|
||||
## I only see DHCP Request/Reply, and the client already has 10.20.50.x
|
||||
|
||||
If your tcpdump on **eth1** shows something like:
|
||||
|
||||
```text
|
||||
0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 88:a2:9e:xx:xx:xx
|
||||
10.20.50.1.67 > 10.20.50.147.68: BOOTP/DHCP, Reply
|
||||
```
|
||||
|
||||
that is **not** the bootloader — it is the **OS** DHCP client (renewal or re-request). The client already has **10.20.50.147**, so this happens **after** the device has booted to the OS.
|
||||
|
||||
- **Bootloader** (network boot): sends **DHCP Discover** (client 0.0.0.0, no IP yet), then you see **Offer**, **Request**, **Ack**, then **TFTP (port 69)** for start4cd.elf, kernel, etc.
|
||||
- **OS**: sends **DHCP Request** (renew/rebind, often already with an IP or requesting a known one), then **Reply** — no Discover, no TFTP.
|
||||
|
||||
So the device **is** on the right segment (eth1, 10.20.50.x). The problem is that you are not seeing the **bootloader’s** DHCP/TFTP during the first seconds after power-on.
|
||||
|
||||
**What to do:**
|
||||
|
||||
1. **Start tcpdump before power-on**
|
||||
Run `tcpdump -i eth1 -n -e port 67 or port 68 or port 69` on the LXC, **then** power off the reTerminal, wait a few seconds, and power it on. Capture from the first second. Look for:
|
||||
- **Discover** (client 0.0.0.0 → broadcast) at the very start → that’s the bootloader.
|
||||
- **TFTP (port 69)** right after DHCP Ack → bootloader loading files.
|
||||
2. If you **never** see Discover or TFTP, only Request/Reply after the OS is up, then the bootloader is either not attempting network boot or is giving up (e.g. link not ready, timeout) and booting from eMMC. Try a full power-off (mains or PSU), wait 10 s, then power on with tcpdump already running.
|
||||
3. Confirm **BOOT_ORDER=0x21** on the device (network first) and that Ethernet is connected before power-on.
|
||||
|
||||
---
|
||||
|
||||
## reTerminal DM: serial console vs USB boot (rpiboot)
|
||||
|
||||
**The serial console is not on the same USB as rpiboot.**
|
||||
|
||||
| Port / interface | Purpose |
|
||||
|------------------|--------|
|
||||
| **USB Type-C** (next to boot-mode switch) | Power, and **rpiboot** when eMMC is disabled (USB device mode). No serial console here. |
|
||||
| **40-pin GPIO header** (UART) | **Serial console.** Use a USB‑to‑serial adapter; connect its **RX** to **GPIO 14 (Pin 8)**, **GND** to **GPIO 15 (Pin 10)** or any GND. |
|
||||
|
||||
**Baud rate:**
|
||||
|
||||
- **Bootloader (BOOT_UART=1):** use **115200** 8N1. This is the Pi EEPROM/bootloader debug output (network boot attempts, DHCP, TFTP, errors).
|
||||
- **OS serial login:** some Seeed docs use **9600** for getty; many Pi images use **115200**. If you only care about bootloader messages, use **115200**.
|
||||
|
||||
So: use the **same USB‑C cable** only for power and rpiboot. For serial console, use a **USB‑to‑serial adapter** on the **GPIO header** at **115200** to see bootloader output.
|
||||
|
||||
---
|
||||
|
||||
## Serial shows "Boot mode: SD (01)" and no network attempt
|
||||
|
||||
If the bootloader serial output shows something like:
|
||||
|
||||
```text
|
||||
Boot mode: SD (01) order 2
|
||||
```
|
||||
|
||||
and you **never** see a line about network (e.g. "Trying DHCP", "TFTP", or "Boot mode: NET (02)"), then the bootloader is **not** attempting network boot for this boot. It goes straight to SD/eMMC (01). That matches “no DHCP during boot, only after OS”.
|
||||
|
||||
**Possible causes:**
|
||||
|
||||
1. **BOOT_ORDER not applied or not read**
|
||||
From the running OS, confirm:
|
||||
`sudo vcgencmd bootloader_config`
|
||||
and check that `BOOT_ORDER=0x21` (and optionally `NET_BOOT_MAX_RETRIES`, `DHCP_TIMEOUT`, `TFTP_IP`). If you see different or missing values, the EEPROM config in use at boot may be different (e.g. old EEPROM, or update not applied on cold boot).
|
||||
|
||||
2. **Network tried but failed before any DHCP**
|
||||
The bootloader may try network, fail very early (e.g. no link, or timeout before sending DHCP), then fall back to SD without printing a “Trying network” line. Slower link-up (switch, cable) can cause this. Increasing `DHCP_TIMEOUT` and `NET_BOOT_MAX_RETRIES` (and setting `TFTP_IP`) gives the best chance.
|
||||
|
||||
3. **CM4 / carrier quirk**
|
||||
On some CM4 carriers the bootloader may skip or shorten the network attempt. Serial is the only way to see what it actually does; if you never see any network-related line, treat it as “network not attempted” for that boot.
|
||||
|
||||
**What to try:**
|
||||
|
||||
- Re-apply EEPROM config with network first and timeouts (as in NETWORK-BOOT-TROUBLESHOOTING), then **full power cycle** (unplug power 10+ s, then power on) with serial connected. Watch from the first character for any “NET”, “DHCP”, “TFTP” or “order” line.
|
||||
- For a one-off test you can set `BOOT_ORDER=0x2` (network only). If network fails, the device won’t boot (no fallback to SD). Use only to confirm whether the bootloader tries network and what it prints; then set back to `0x21`. If the full serial log never shows "NET", "DHCP", or "TFTP" and goes straight to "Boot mode: SD (01) order 2", trying `BOOT_ORDER=0x2` (network only) once will force a network attempt and should produce DHCP/TFTP messages on serial.
|
||||
|
||||
---
|
||||
|
||||
## TFTP "file .../SERIAL/start4.elf not found" — serial-number prefix
|
||||
|
||||
The Pi bootloader may request files under a path named after the board serial number (e.g. `0d1ddbda/start4.elf`). If the TFTP root has no such subdirectory, those requests fail and the bootloader falls back to the root (e.g. `start4.elf`). To avoid "not found" for the first requests, on the LXC create the serial directory and symlink the boot files:
|
||||
|
||||
```bash
|
||||
# On the LXC (replace 0d1ddbda with your Pi's serial from vcgencmd or serial output)
|
||||
mkdir -p /srv/tftpboot/0d1ddbda
|
||||
cd /srv/tftpboot/0d1ddbda
|
||||
for f in start4.elf start4cd.elf start.elf fixup4.dat fixup4cd.dat config.txt cmdline.txt kernel8.img initrd.img; do
|
||||
[ -f ../$f ] && ln -sf ../$f $f
|
||||
done
|
||||
```
|
||||
|
||||
After that, the bootloader’s first TFTP requests succeed. The device already had this directory created for serial `0d1ddbda`.
|
||||
|
||||
---
|
||||
|
||||
## Stuck in network-only boot (BOOT_ORDER=0x2): get back to Raspbian and change boot order
|
||||
|
||||
If you set **BOOT_ORDER=0x2** (network only) for testing, the device will never try eMMC. To get back to Raspbian and set **BOOT_ORDER=0x1** or **0x21**, use **rescue mode**: the network boot chain loads the provisioning initramfs; with a special kernel cmdline it drops to a shell so you can mount eMMC and run **rpi-eeprom-config** from the eMMC install.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- **Initramfs with rescue support** — Build the initramfs (it includes `/rescue-eeprom.sh`) and copy it to the LXC TFTP root and into the serial dir:
|
||||
```bash
|
||||
cd emmc-provisioning/network-boot-initramfs && ./build.sh
|
||||
scp initrd.img root@<LXC-IP>:/srv/tftpboot/
|
||||
ssh root@<LXC-IP> 'cp /srv/tftpboot/initrd.img /srv/tftpboot/0d1ddbda/ 2>/dev/null || true'
|
||||
```
|
||||
- **TFTP config** — Ensure `/srv/tftpboot/config.txt` (and thus `0d1ddbda/config.txt` if it’s a symlink) has `kernel=kernel8.img` and `initramfs initrd.img followkernel` so the full kernel+initrd chain runs.
|
||||
|
||||
### Steps
|
||||
|
||||
1. **On the LXC**, enable rescue for this device by serving a cmdline that includes **provisioning_rescue=1**. The Pi loads `0d1ddbda/cmdline.txt`; replace that with a **real file** (not a symlink) so this device gets the rescue cmdline:
|
||||
```bash
|
||||
# On the LXC (replace 0d1ddbda with your Pi serial if different)
|
||||
CD="/srv/tftpboot/0d1ddbda"
|
||||
rm -f "$CD/cmdline.txt"
|
||||
# Same as root cmdline plus rescue flag (one line, space-separated)
|
||||
cat /srv/tftpboot/cmdline.txt | tr '\n' ' ' > "$CD/cmdline.txt"
|
||||
echo -n ' provisioning_rescue=1' >> "$CD/cmdline.txt"
|
||||
echo >> "$CD/cmdline.txt"
|
||||
```
|
||||
|
||||
2. **Power on the reTerminal** (or reboot). It will network boot, load kernel + initramfs, and **rescue mode** will start a shell (serial or console). You should see:
|
||||
`=== RESCUE MODE (provisioning_rescue=1) ===`
|
||||
|
||||
3. **In the rescue shell**, run the helper to mount eMMC and run the EEPROM config from the eMMC install:
|
||||
```bash
|
||||
/rescue-eeprom.sh
|
||||
```
|
||||
In the editor that opens, set **BOOT_ORDER=0x1** (eMMC only) or **0x21** (network first, then eMMC). Save and exit the editor.
|
||||
|
||||
4. **Reboot** from the rescue shell:
|
||||
```bash
|
||||
reboot
|
||||
```
|
||||
The bootloader will apply the EEPROM update and on the next boot use the new order (eMMC only with 0x1, or network then eMMC with 0x21).
|
||||
|
||||
5. **On the LXC**, restore normal cmdline for the device so the next network boot runs the provisioning client, not rescue:
|
||||
```bash
|
||||
rm -f /srv/tftpboot/0d1ddbda/cmdline.txt
|
||||
ln -s ../cmdline.txt /srv/tftpboot/0d1ddbda/cmdline.txt
|
||||
```
|
||||
|
||||
See also **NETWORK-BOOT-LXC.md** for setup and monitoring.
|
||||
Reference in New Issue
Block a user