Files
reterminal-dm4/emmc-provisioning/docs/NETWORK-BOOT-TROUBLESHOOTING.md
nearxos ea6f846021 Improve network boot troubleshooting documentation and initramfs scripts
Update NETWORK-BOOT-TROUBLESHOOTING.md to clarify the boot process and emphasize the need to disable PXE before rebooting to ensure EEPROM updates are applied. Enhance initramfs scripts to improve DHCP lease acquisition and capture the device's IP address more reliably. Add a revision tracking feature to the initramfs build process for better version control. Modify provisioning-client.sh to ensure proper reboot handling after deployment and backup actions.
2026-02-21 12:57:26 +02:00

261 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Network boot troubleshooting: no DHCP/TFTP during boot, only after OS is up
If you run **tcpdump** during power-on but see **no DHCP/TFTP traffic during boot**, and only see traffic **after** the device has booted to the OS, the reTerminal is almost certainly **not on the same L2 segment as the LXC's eth1**.
## Whats going on
- The Pis **bootloader** (EEPROM) sends DHCP Discover on the Ethernet port when it tries network boot.
- That request only reaches interfaces on the **same VLAN / same bridge** (same cable/switch segment).
- dnsmasq in the LXC listens only on **eth1** (provisioning LAN).
- If the reTerminal is plugged into the **main office LAN** (or the same segment as the LXCs **eth0**), the netboot DHCP **never reaches eth1** — so you see no DHCP/TFTP on eth1 during boot.
- After the OS boots, it uses the same Ethernet port and gets an IP from the main LAN; you then see traffic (e.g. on eth0 or from the devices new IP). Thats why you only see traffic “after the device boots to OS”.
## What to do
### 1. Confirm which interface sees the boot-time DHCP
On the LXC, run tcpdump on **both** interfaces in two terminals (or run one in background):
```bash
# Terminal 1: provisioning LAN (where netboot should happen)
tcpdump -i eth1 -n -e port 67 or port 68 or port 69
# Terminal 2: WAN / main LAN
tcpdump -i eth0 -n -e port 67 or port 68 or port 69
```
Then **power off** the reTerminal and **power it on**. Watch where DHCP (and TFTP) appear:
- If you see DHCP **only on eth0** during boot → the reTerminal is on the same segment as **eth0**, not eth1. So netboot is not using your LXCs dnsmasq; the device may get an IP from another DHCP server and fall back to eMMC boot.
- If you see DHCP **on eth1** during boot → the reTerminal is on the provisioning segment; you should then see TFTP (port 69) as well.
### 2. Fix: put the reTerminal on the same segment as eth1
- The reTerminals Ethernet cable must be connected to the **provisioning** segment: the same VLAN or bridge as the LXCs **eth1** (e.g. 10.20.50.0/24).
- On Proxmox, eth1 is often on a **dedicated bridge** (e.g. `vmbr1`). The reTerminal must be plugged into a switch port that belongs to that same bridge/VLAN.
- If you have one physical switch: either put the LXCs eth1 and the reTerminal in the same VLAN, or use a dedicated “provisioning” port group / switch.
### 3. Sanity check: same port as reTerminal
- Plug a **laptop** (or another device) into the **same port** (or same VLAN) as the reTerminal.
- Run: `sudo dhclient -v <interface>` (or let it get DHCP automatically).
- If you get an IP in **10.20.50.x** → that segment is your provisioning LAN (eth1); the reTerminal should netboot from there.
- If you get a different range (e.g. 192.168.x.x) → that segment is **not** the provisioning LAN; move the reTerminals cable or VLAN to the segment where 10.20.50.x is served.
## Summary table
| Symptom | Likely cause | Action |
|--------|---------------|--------|
| No DHCP/TFTP on eth1 during boot; traffic only after OS | reTerminal on different segment than eth1 | Plug reTerminal into same VLAN/bridge as LXC eth1 (provisioning LAN) |
| DHCP on eth0 during boot, none on eth1 | reTerminal on same segment as eth0 | Move reTerminal to provisioning segment (same as eth1) |
| No DHCP on any interface during boot | Cable unplugged, BOOT_ORDER not 0x21, or device not attempting netboot | Check cable, confirm BOOT_ORDER=0x21, power cycle with cable in before power |
---
## I only see DHCP Request/Reply, and the client already has 10.20.50.x
If your tcpdump on **eth1** shows something like:
```text
0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 88:a2:9e:xx:xx:xx
10.20.50.1.67 > 10.20.50.147.68: BOOTP/DHCP, Reply
```
that is **not** the bootloader — it is the **OS** DHCP client (renewal or re-request). The client already has **10.20.50.147**, so this happens **after** the device has booted to the OS.
- **Bootloader** (network boot): sends **DHCP Discover** (client 0.0.0.0, no IP yet), then you see **Offer**, **Request**, **Ack**, then **TFTP (port 69)** for start4cd.elf, kernel, etc.
- **OS**: sends **DHCP Request** (renew/rebind, often already with an IP or requesting a known one), then **Reply** — no Discover, no TFTP.
So the device **is** on the right segment (eth1, 10.20.50.x). The problem is that you are not seeing the **bootloaders** DHCP/TFTP during the first seconds after power-on.
**What to do:**
1. **Start tcpdump before power-on**
Run `tcpdump -i eth1 -n -e port 67 or port 68 or port 69` on the LXC, **then** power off the reTerminal, wait a few seconds, and power it on. Capture from the first second. Look for:
- **Discover** (client 0.0.0.0 → broadcast) at the very start → thats the bootloader.
- **TFTP (port 69)** right after DHCP Ack → bootloader loading files.
2. If you **never** see Discover or TFTP, only Request/Reply after the OS is up, then the bootloader is either not attempting network boot or is giving up (e.g. link not ready, timeout) and booting from eMMC. Try a full power-off (mains or PSU), wait 10 s, then power on with tcpdump already running.
3. Confirm **BOOT_ORDER=0x21** on the device (network first) and that Ethernet is connected before power-on.
---
## reTerminal DM: serial console vs USB boot (rpiboot)
**The serial console is not on the same USB as rpiboot.**
| Port / interface | Purpose |
|------------------|--------|
| **USB Type-C** (next to boot-mode switch) | Power, and **rpiboot** when eMMC is disabled (USB device mode). No serial console here. |
| **40-pin GPIO header** (UART) | **Serial console.** Use a USBtoserial adapter; connect its **RX** to **GPIO 14 (Pin 8)**, **GND** to **GPIO 15 (Pin 10)** or any GND. |
**Baud rate:**
- **Bootloader (BOOT_UART=1):** use **115200** 8N1. This is the Pi EEPROM/bootloader debug output (network boot attempts, DHCP, TFTP, errors).
- **OS serial login:** some Seeed docs use **9600** for getty; many Pi images use **115200**. If you only care about bootloader messages, use **115200**.
So: use the **same USBC cable** only for power and rpiboot. For serial console, use a **USBtoserial adapter** on the **GPIO header** at **115200** to see bootloader output.
---
## Serial shows "Boot mode: SD (01)" and no network attempt
If the bootloader serial output shows something like:
```text
Boot mode: SD (01) order 2
```
and you **never** see a line about network (e.g. "Trying DHCP", "TFTP", or "Boot mode: NET (02)"), then the bootloader is **not** attempting network boot for this boot. It goes straight to SD/eMMC (01). That matches “no DHCP during boot, only after OS”.
**Possible causes:**
1. **BOOT_ORDER not applied or not read**
From the running OS, confirm:
`sudo vcgencmd bootloader_config`
and check that `BOOT_ORDER=0x21` (and optionally `NET_BOOT_MAX_RETRIES`, `DHCP_TIMEOUT`, `TFTP_IP`). If you see different or missing values, the EEPROM config in use at boot may be different (e.g. old EEPROM, or update not applied on cold boot).
2. **Network tried but failed before any DHCP**
The bootloader may try network, fail very early (e.g. no link, or timeout before sending DHCP), then fall back to SD without printing a “Trying network” line. Slower link-up (switch, cable) can cause this. Increasing `DHCP_TIMEOUT` and `NET_BOOT_MAX_RETRIES` (and setting `TFTP_IP`) gives the best chance.
3. **CM4 / carrier quirk**
On some CM4 carriers the bootloader may skip or shorten the network attempt. Serial is the only way to see what it actually does; if you never see any network-related line, treat it as “network not attempted” for that boot.
**What to try:**
- Re-apply EEPROM config with network first and timeouts (as in NETWORK-BOOT-TROUBLESHOOTING), then **full power cycle** (unplug power 10+ s, then power on) with serial connected. Watch from the first character for any “NET”, “DHCP”, “TFTP” or “order” line.
- For a one-off test you can set `BOOT_ORDER=0x2` (network only). If network fails, the device wont boot (no fallback to SD). Use only to confirm whether the bootloader tries network and what it prints; then set back to `0x21`. If the full serial log never shows "NET", "DHCP", or "TFTP" and goes straight to "Boot mode: SD (01) order 2", trying `BOOT_ORDER=0x2` (network only) once will force a network attempt and should produce DHCP/TFTP messages on serial.
---
## Boot stops after start4.elf ("PCI0 reset" then nothing)
### Whats actually going on
The **EEPROM bootloader** only does TFTP for config.txt, start4.elf, and fixup4.dat. It then **starts the GPU firmware (start4.elf)** and **stops the network**. The **kernel and initrd are loaded by the GPU firmware**, not by the EEPROM: after “Starting start4.elf”, the GPU is supposed to bring the network back up and TFTP kernel8.img, cmdline.txt, and initrd.img. If you never see TFTP for kernel8.img/initrd.img and the log stops at “PCI0 reset”, the GPU stage is not doing that. Common causes:
1. **Config not seen by the GPU** — The config the EEPROM fetched (e.g. from `0d1ddbda/config.txt`) must contain `kernel=kernel8.img` and `initramfs initrd.img followkernel`. If that file was a symlink or truncated, the GPU may not see those lines. Use a **real copy** of the full config in the serial dir (see ensure script below).
2. **No visibility into the GPU** — The EEPROM logs stop at “PCI0 reset”; the next step is inside the GPU firmware. To see GPU messages (e.g. network bring-up, TFTP, or errors), add **`uart_2ndstage=1`** to config.txt so the GPU logs to the UART. Then power-cycle and watch for lines like `MESS:... genet: LINK STATUS` or TFTP activity.
3. **Firmware/board quirk** — On some boards or firmware versions the GPU netboot path can fail silently. Ensuring the latest Pi 4/CM4 boot files in the TFTP root and trying **start4cd.elf** + **fixup4cd.dat** (or leaving defaults) is worth a try.
If the serial log shows **TFTP** for config.txt, start4.elf, fixup4.dat, then **"Starting start4.elf"**, **"Stopping network"**, **"PCI0 reset"**, and **no** TFTP requests for **kernel8.img** or **initrd.img**, use the checks below.
**Fix on the LXC:** ensure `/srv/tftpboot/config.txt` contains (and that `0d1ddbda/config.txt` is a real copy with the same content):
```ini
enable_uart=1
kernel=kernel8.img
initramfs initrd.img followkernel
uart_2ndstage=1
```
`enable_uart=1` is required for the kernel serial console when netbooting (otherwise the firmware can set 8250.nr_uarts=0). `uart_2ndstage=1` makes the GPU firmware log to the UART so you see **MESS:** lines after "PCI0 reset" (e.g. network bring-up, TFTP, or errors).
You can run:
```bash
# On the LXC (or from your machine)
ssh root@<LXC-IP> 'bash -s' < emmc-provisioning/scripts/ensure-tftpboot-config-kernel-initrd.sh
```
Also ensure the TFTP root has **kernel8.img** and **initrd.img** (and the serial subdir has symlinks or copies). Then power-cycle the device; you should see TFTP_GET for kernel8.img and initrd.img, then the kernel and initramfs (e.g. rescue shell or provisioning client) run.
**If it still stops after “PCI0 reset”:**
- Add **`uart_2ndstage=1`** to the TFTP config.txt (root and serial copy). Re-run the ensure script so the serial dir gets the updated config, then power-cycle. Watch the serial log for **MESS:** lines from the GPU (e.g. `genet: LINK STATUS`, TFTP, or errors). That shows whether the GPU is bringing the network up and trying to load the kernel.
- On the LXC, confirm the config the device gets has the right size and content:
`ssh root@<LXC-IP> 'wc -c /srv/tftpboot/0d1ddbda/config.txt && grep -E "kernel|initramfs|uart_2ndstage" /srv/tftpboot/0d1ddbda/config.txt'`
---
## Kernel loads but serial stops at "Baud rate change done" (no rescue shell)
If you see the GPU load kernel8.img and initrd.img, then **"Baud rate change done..."** and nothing else (no rescue shell, no kernel messages), the kernel is likely hanging very early because of a **missing or invalid Device Tree**. The GPU log may show **`dterror: Failed to load Device Tree file '?'`**.
The GPU loads files from the **serial-prefix** dir (e.g. `0d1ddbda/`). If the **.dtb** files (e.g. `bcm2711-rpi-cm4.dtb`, `bcm2711-rpi-cm4-io.dtb`) are only in the TFTP root and not in that dir, the firmware can fail to load the right DTB and the kernel gets no valid device tree.
**Fix:** Ensure the TFTP root has the Pi 4/CM4 DTB files (from the [Raspberry Pi firmware](https://github.com/raspberrypi/firmware) `boot/` folder) and that each **serial-prefix** dir has symlinks to them. Re-run the ensure script (it now links `*.dtb` into each serial dir):
```bash
ssh root@<LXC-IP> 'bash -s' < emmc-provisioning/scripts/ensure-tftpboot-config-kernel-initrd.sh
```
If the TFTP root has no `*.dtb` files, populate it from the Pi firmware (e.g. run `populate-tftpboot-from-git.sh` or copy `bcm2711-rpi-cm4.dtb`, `bcm2711-rpi-cm4-io.dtb`, and other `bcm2711*.dtb` from the firmware repo into `/srv/tftpboot`), then run the ensure script again and power-cycle the device.
**Serial stops at "Baud rate change done" (no kernel/initramfs output):** On Pi 4/CM4 netboot, the firmware can force **8250.nr_uarts=0**, which disables the kernel serial driver so you get no console after the GPU handoff ([raspberrypi/firmware#1575](https://github.com/raspberrypi/firmware/issues/1575)). The workaround is **`enable_uart=1`** in config.txt (within the first 4KB). The ensure script adds it; re-run the script so the root and serial-prefix configs have it, then power-cycle. Keep serial at **115200** baud.
---
## TFTP "file .../SERIAL/start4.elf not found" — serial-number prefix
The Pi bootloader may request files under a path named after the board serial number (e.g. `0d1ddbda/start4.elf`). If the TFTP root has no such subdirectory, those requests fail and the bootloader falls back to the root (e.g. `start4.elf`). To avoid "not found" for the first requests, on the LXC create the serial directory and symlink the boot files:
```bash
# On the LXC (replace 0d1ddbda with your Pi's serial from vcgencmd or serial output)
mkdir -p /srv/tftpboot/0d1ddbda
cd /srv/tftpboot/0d1ddbda
for f in start4.elf start4cd.elf start.elf fixup4.dat fixup4cd.dat config.txt cmdline.txt kernel8.img initrd.img; do
[ -f ../$f ] && ln -sf ../$f $f
done
```
After that, the bootloaders first TFTP requests succeed. The device already had this directory created for serial `0d1ddbda`.
---
## Stuck in network-only boot (BOOT_ORDER=0x2): get back to Raspbian and change boot order
If you set **BOOT_ORDER=0x2** (network only) for testing, the device will never try eMMC. To get back to Raspbian and set **BOOT_ORDER=0x1** or **0x21**, use **rescue mode**: the network boot chain loads the provisioning initramfs; with a special kernel cmdline it drops to a shell so you can mount eMMC and run **rpi-eeprom-config** from the eMMC install.
### Prerequisites
- **Initramfs with rescue support** — Build the initramfs (it includes `/rescue-eeprom.sh`) and copy it to the LXC TFTP root and into the serial dir:
```bash
cd emmc-provisioning/network-boot-initramfs && ./build.sh
scp initrd.img root@<LXC-IP>:/srv/tftpboot/
ssh root@<LXC-IP> 'cp /srv/tftpboot/initrd.img /srv/tftpboot/0d1ddbda/ 2>/dev/null || true'
```
- **TFTP config** — Ensure `/srv/tftpboot/config.txt` (and thus `0d1ddbda/config.txt` if its a symlink) has `kernel=kernel8.img` and `initramfs initrd.img followkernel` so the full kernel+initrd chain runs.
### Steps
1. **On the LXC**, enable rescue for this device by serving a cmdline that includes **provisioning_rescue=1**. The Pi loads `0d1ddbda/cmdline.txt`; replace that with a **real file** (not a symlink) so this device gets the rescue cmdline:
```bash
# On the LXC (replace 0d1ddbda with your Pi serial if different)
CD="/srv/tftpboot/0d1ddbda"
rm -f "$CD/cmdline.txt"
# Same as root cmdline plus rescue flag (one line, space-separated)
cat /srv/tftpboot/cmdline.txt | tr '\n' ' ' > "$CD/cmdline.txt"
echo -n ' provisioning_rescue=1' >> "$CD/cmdline.txt"
echo >> "$CD/cmdline.txt"
```
2. **Power on the reTerminal** (or reboot). It will network boot, load kernel + initramfs, and **rescue mode** will start a shell (serial or console). You should see:
`=== RESCUE MODE (provisioning_rescue=1) ===`
3. **In the rescue shell**, run the helper to mount eMMC and run the EEPROM config from the eMMC install:
```bash
/rescue-eeprom.sh
```
In the editor that opens, set **BOOT_ORDER=0x1** (eMMC only) or **0x21** (network first, then eMMC). Save and exit the editor.
4. **Reboot** from the rescue shell:
```bash
reboot
```
The bootloader will apply the EEPROM update and on the next boot use the new order (eMMC only with 0x1, or network then eMMC with 0x21).
5. **Reboot and apply the update** — The EEPROM update is only applied when the bootloader **boots from the same storage** where the update file was written. You wrote it to **eMMC**, so the bootloader must **boot from eMMC** once to apply it. With **BOOT_ORDER=0x2** (network only) the next reboot netboots again, so the bootloader never reads eMMC and the update is never applied. Do this **before** rebooting from the rescue shell:
- **On the LXC**, disable PXE so the next boot does not advertise TFTP:
`ssh root@<LXC-IP> '/opt/cm4-provisioning/toggle-network-boot-dhcp.sh disable'`
- Then **power cycle** the reTerminal (or run `reboot -f` / `echo b > /proc/sysrq-trigger` in the rescue shell). The bootloader will get DHCP without option 66/67; it may then try eMMC (depending on firmware) and apply the update. If it still netboots (e.g. cached TFTP), unplug the Ethernet cable and power cycle so it has no choice but eMMC.
6. **After you are back in Raspbian**, restore normal cmdline for the device so the next network boot runs the provisioning client, not rescue:
```bash
./emmc-provisioning/scripts/disable-rescue-cmdline-on-lxc.sh root@<LXC-IP> 0d1ddbda
```
Or on the LXC: `rm -f /srv/tftpboot/0d1ddbda/cmdline.txt && ln -s ../cmdline.txt /srv/tftpboot/0d1ddbda/cmdline.txt`
**Why did my boot order not change?** The update file was written to the **eMMC** boot partition. The bootloader applies it only when it **boots from that partition**. When you rebooted, the device netbooted again (TFTP), so the bootloader read the “boot” files from the network, not from eMMC, and never saw or applied the update. Disable PXE (and optionally unplug Ethernet) before rebooting so the next boot is from eMMC and the update is applied.
See also **NETWORK-BOOT-LXC.md** for setup and monitoring.