Enhance network boot troubleshooting documentation and scripts
Update NETWORK-BOOT-TROUBLESHOOTING.md to clarify the boot process after start4.elf, emphasizing the importance of config.txt settings for kernel and initramfs. Introduce checks for GPU logging and ensure proper configuration for UART. Modify initramfs scripts to improve DHCP lease acquisition and ensure shell output is directed to the serial console. Update ensure-tftpboot-config-kernel-initrd.sh to enforce necessary config settings and link DTB files in serial-prefix directories for better device compatibility.
This commit is contained in:
@@ -129,15 +129,27 @@ and you **never** see a line about network (e.g. "Trying DHCP", "TFTP", or "Boot
|
||||
|
||||
## Boot stops after start4.elf ("PCI0 reset" then nothing)
|
||||
|
||||
If the serial log shows **TFTP** for config.txt, start4.elf, fixup4.dat, then **"Starting start4.elf"**, **"Stopping network"**, **"PCI0 reset"**, and **no** TFTP requests for **kernel8.img** or **initrd.img**, the bootloader is not loading the kernel. That usually means **config.txt** in the TFTP root does not have the **kernel** and **initramfs** lines.
|
||||
### What’s actually going on
|
||||
|
||||
**Fix on the LXC:** ensure `/srv/tftpboot/config.txt` contains (and that `0d1ddbda/config.txt` is a symlink to it or has the same content):
|
||||
The **EEPROM bootloader** only does TFTP for config.txt, start4.elf, and fixup4.dat. It then **starts the GPU firmware (start4.elf)** and **stops the network**. The **kernel and initrd are loaded by the GPU firmware**, not by the EEPROM: after “Starting start4.elf”, the GPU is supposed to bring the network back up and TFTP kernel8.img, cmdline.txt, and initrd.img. If you never see TFTP for kernel8.img/initrd.img and the log stops at “PCI0 reset”, the GPU stage is not doing that. Common causes:
|
||||
|
||||
1. **Config not seen by the GPU** — The config the EEPROM fetched (e.g. from `0d1ddbda/config.txt`) must contain `kernel=kernel8.img` and `initramfs initrd.img followkernel`. If that file was a symlink or truncated, the GPU may not see those lines. Use a **real copy** of the full config in the serial dir (see ensure script below).
|
||||
2. **No visibility into the GPU** — The EEPROM logs stop at “PCI0 reset”; the next step is inside the GPU firmware. To see GPU messages (e.g. network bring-up, TFTP, or errors), add **`uart_2ndstage=1`** to config.txt so the GPU logs to the UART. Then power-cycle and watch for lines like `MESS:... genet: LINK STATUS` or TFTP activity.
|
||||
3. **Firmware/board quirk** — On some boards or firmware versions the GPU netboot path can fail silently. Ensuring the latest Pi 4/CM4 boot files in the TFTP root and trying **start4cd.elf** + **fixup4cd.dat** (or leaving defaults) is worth a try.
|
||||
|
||||
If the serial log shows **TFTP** for config.txt, start4.elf, fixup4.dat, then **"Starting start4.elf"**, **"Stopping network"**, **"PCI0 reset"**, and **no** TFTP requests for **kernel8.img** or **initrd.img**, use the checks below.
|
||||
|
||||
**Fix on the LXC:** ensure `/srv/tftpboot/config.txt` contains (and that `0d1ddbda/config.txt` is a real copy with the same content):
|
||||
|
||||
```ini
|
||||
enable_uart=1
|
||||
kernel=kernel8.img
|
||||
initramfs initrd.img followkernel
|
||||
uart_2ndstage=1
|
||||
```
|
||||
|
||||
`enable_uart=1` is required for the kernel serial console when netbooting (otherwise the firmware can set 8250.nr_uarts=0). `uart_2ndstage=1` makes the GPU firmware log to the UART so you see **MESS:** lines after "PCI0 reset" (e.g. network bring-up, TFTP, or errors).
|
||||
|
||||
You can run:
|
||||
|
||||
```bash
|
||||
@@ -147,6 +159,30 @@ ssh root@<LXC-IP> 'bash -s' < emmc-provisioning/scripts/ensure-tftpboot-config-k
|
||||
|
||||
Also ensure the TFTP root has **kernel8.img** and **initrd.img** (and the serial subdir has symlinks or copies). Then power-cycle the device; you should see TFTP_GET for kernel8.img and initrd.img, then the kernel and initramfs (e.g. rescue shell or provisioning client) run.
|
||||
|
||||
**If it still stops after “PCI0 reset”:**
|
||||
|
||||
- Add **`uart_2ndstage=1`** to the TFTP config.txt (root and serial copy). Re-run the ensure script so the serial dir gets the updated config, then power-cycle. Watch the serial log for **MESS:** lines from the GPU (e.g. `genet: LINK STATUS`, TFTP, or errors). That shows whether the GPU is bringing the network up and trying to load the kernel.
|
||||
- On the LXC, confirm the config the device gets has the right size and content:
|
||||
`ssh root@<LXC-IP> 'wc -c /srv/tftpboot/0d1ddbda/config.txt && grep -E "kernel|initramfs|uart_2ndstage" /srv/tftpboot/0d1ddbda/config.txt'`
|
||||
|
||||
---
|
||||
|
||||
## Kernel loads but serial stops at "Baud rate change done" (no rescue shell)
|
||||
|
||||
If you see the GPU load kernel8.img and initrd.img, then **"Baud rate change done..."** and nothing else (no rescue shell, no kernel messages), the kernel is likely hanging very early because of a **missing or invalid Device Tree**. The GPU log may show **`dterror: Failed to load Device Tree file '?'`**.
|
||||
|
||||
The GPU loads files from the **serial-prefix** dir (e.g. `0d1ddbda/`). If the **.dtb** files (e.g. `bcm2711-rpi-cm4.dtb`, `bcm2711-rpi-cm4-io.dtb`) are only in the TFTP root and not in that dir, the firmware can fail to load the right DTB and the kernel gets no valid device tree.
|
||||
|
||||
**Fix:** Ensure the TFTP root has the Pi 4/CM4 DTB files (from the [Raspberry Pi firmware](https://github.com/raspberrypi/firmware) `boot/` folder) and that each **serial-prefix** dir has symlinks to them. Re-run the ensure script (it now links `*.dtb` into each serial dir):
|
||||
|
||||
```bash
|
||||
ssh root@<LXC-IP> 'bash -s' < emmc-provisioning/scripts/ensure-tftpboot-config-kernel-initrd.sh
|
||||
```
|
||||
|
||||
If the TFTP root has no `*.dtb` files, populate it from the Pi firmware (e.g. run `populate-tftpboot-from-git.sh` or copy `bcm2711-rpi-cm4.dtb`, `bcm2711-rpi-cm4-io.dtb`, and other `bcm2711*.dtb` from the firmware repo into `/srv/tftpboot`), then run the ensure script again and power-cycle the device.
|
||||
|
||||
**Serial stops at "Baud rate change done" (no kernel/initramfs output):** On Pi 4/CM4 netboot, the firmware can force **8250.nr_uarts=0**, which disables the kernel serial driver so you get no console after the GPU handoff ([raspberrypi/firmware#1575](https://github.com/raspberrypi/firmware/issues/1575)). The workaround is **`enable_uart=1`** in config.txt (within the first 4KB). The ensure script adds it; re-run the script so the root and serial-prefix configs have it, then power-cycle. Keep serial at **115200** baud.
|
||||
|
||||
---
|
||||
|
||||
## TFTP "file .../SERIAL/start4.elf not found" — serial-number prefix
|
||||
|
||||
@@ -144,13 +144,18 @@ if [ ! -f "$BUILD_DIR/bin/busybox" ] || [ ! -s "$BUILD_DIR/bin/busybox" ]; then
|
||||
fi
|
||||
chmod +x "$BUILD_DIR/bin/busybox" 2>/dev/null || true
|
||||
|
||||
# Busybox applets we need (sh, mount, udhcpc, etc.)
|
||||
# Busybox applet symlinks (mount, mkdir, etc.). When building arm64 on x86, busybox cannot be run so --list fails; create symlinks manually.
|
||||
APPLETS="sh ash mount umount mkdir cat ip udhcpc sleep echo grep cut awk hostname dd reboot chroot ls rm"
|
||||
cd "$BUILD_DIR/bin"
|
||||
./busybox --list 2>/dev/null | while read applet; do
|
||||
case "$applet" in
|
||||
sh|ash|mount|umount|mkdir|cat|ip|udhcpc|sleep|echo|grep|cut|awk|hostname|dd|reboot) ln -sf busybox "$applet"; ;;
|
||||
esac
|
||||
done
|
||||
if ./busybox --list >/dev/null 2>&1; then
|
||||
./busybox --list | while read applet; do
|
||||
case " $APPLETS " in *" $applet "*) ln -sf busybox "$applet"; ;; esac
|
||||
done
|
||||
else
|
||||
for applet in $APPLETS; do
|
||||
[ -e "$applet" ] || ln -sf busybox "$applet"
|
||||
done
|
||||
fi
|
||||
[ -e sh ] || ln -sf busybox sh
|
||||
|
||||
# Build cpio (gzip)
|
||||
|
||||
@@ -15,10 +15,11 @@ mount -t devtmpfs none /dev
|
||||
mkdir -p /dev/pts
|
||||
mount -t devpts none /dev/pts
|
||||
|
||||
# Kernel might have brought up eth0 via ip=dhcp; ensure we have an IP
|
||||
# Kernel might have brought up eth0 via ip=dhcp; ensure we have an IP (run in background with timeout so we don't block rescue shell)
|
||||
if ! ip addr show | grep -q 'inet .* scope global'; then
|
||||
echo "Getting DHCP lease..."
|
||||
udhcpc -f -q -i eth0 -n 2>/dev/null || true
|
||||
( udhcpc -f -q -i eth0 -n -T 5 2>/dev/null || true ) &
|
||||
sleep 6
|
||||
fi
|
||||
|
||||
# Allow kernel cmdline to override: provisioning_server=... and rescue mode
|
||||
@@ -35,7 +36,8 @@ export PROVISIONING_SERVER
|
||||
if [ "$RESCUE" -eq 1 ]; then
|
||||
echo "=== RESCUE MODE (provisioning_rescue=1) ==="
|
||||
echo "Run /rescue-eeprom.sh to mount eMMC and change boot order (rpi-eeprom-config), then reboot."
|
||||
echo "Or run /bin/sh for a shell."
|
||||
# Ensure shell I/O goes to serial console (some setups drop output otherwise)
|
||||
[ -c /dev/console ] && exec </dev/console >/dev/console 2>&1
|
||||
exec /bin/sh -i
|
||||
fi
|
||||
|
||||
|
||||
Binary file not shown.
24
emmc-provisioning/scripts/check-dhcp-network-boot-on-lxc.sh
Executable file
24
emmc-provisioning/scripts/check-dhcp-network-boot-on-lxc.sh
Executable file
@@ -0,0 +1,24 @@
|
||||
#!/usr/bin/env bash
|
||||
# Check whether DHCP network-boot options (66/67) are enabled on the LXC.
|
||||
# Usage: ./check-dhcp-network-boot-on-lxc.sh [LXC_HOST]
|
||||
# Example: ./check-dhcp-network-boot-on-lxc.sh root@10.20.30.153
|
||||
|
||||
LXC="${1:-root@10.20.30.153}"
|
||||
PXE_CONF="/etc/dnsmasq.d/network-boot-pxe.conf"
|
||||
|
||||
echo "Checking DHCP network-boot status on $LXC ..."
|
||||
ssh "$LXC" "bash -s" << 'REMOTE'
|
||||
PXE_CONF="/etc/dnsmasq.d/network-boot-pxe.conf"
|
||||
if [ -f "$PXE_CONF" ]; then
|
||||
echo "Status: ENABLED (option 66/67 are advertised - devices will try network boot)"
|
||||
echo "Content of $PXE_CONF:"
|
||||
cat "$PXE_CONF"
|
||||
else
|
||||
echo "Status: DISABLED (no PXE options - devices get DHCP only and boot from local storage)"
|
||||
fi
|
||||
# Also show toggle script status if present
|
||||
if [ -x /opt/cm4-provisioning/toggle-network-boot-dhcp.sh ]; then
|
||||
echo ""
|
||||
echo "Toggle script output: $(/opt/cm4-provisioning/toggle-network-boot-dhcp.sh status 2>/dev/null)"
|
||||
fi
|
||||
REMOTE
|
||||
@@ -1,6 +1,6 @@
|
||||
#!/usr/bin/env bash
|
||||
# Ensure TFTP config.txt on the LXC has kernel=kernel8.img and initramfs initrd.img followkernel
|
||||
# so the bootloader loads the kernel and initrd (otherwise boot stops after start4.elf).
|
||||
# Ensure TFTP config.txt on the LXC has kernel=kernel8.img, initramfs initrd.img followkernel,
|
||||
# and uart_2ndstage=1 (GPU firmware logs to UART for netboot debugging).
|
||||
# Run on LXC: bash ensure-tftpboot-config-kernel-initrd.sh
|
||||
# Or: ssh root@10.20.30.153 'bash -s' < emmc-provisioning/scripts/ensure-tftpboot-config-kernel-initrd.sh
|
||||
|
||||
@@ -14,6 +14,12 @@ if [[ ! -f "$CONFIG" ]]; then
|
||||
fi
|
||||
|
||||
CHANGED=0
|
||||
# enable_uart=1 must be present (and within first 4KB of config) so netboot firmware sets 8250.nr_uarts=1; else kernel has no serial console (Pi firmware #1575).
|
||||
if ! grep -qE 'enable_uart=1' "$CONFIG" 2>/dev/null; then
|
||||
echo "Adding enable_uart=1 to $CONFIG (required for kernel serial on netboot)"
|
||||
echo "enable_uart=1" >> "$CONFIG"
|
||||
CHANGED=1
|
||||
fi
|
||||
if ! grep -qE '^kernel=kernel8\.img' "$CONFIG" 2>/dev/null; then
|
||||
echo "Adding kernel=kernel8.img to $CONFIG"
|
||||
echo "kernel=kernel8.img" >> "$CONFIG"
|
||||
@@ -26,20 +32,34 @@ if ! grep -qE 'initramfs initrd\.img' "$CONFIG" 2>/dev/null; then
|
||||
echo "initramfs initrd.img followkernel" >> "$CONFIG"
|
||||
CHANGED=1
|
||||
fi
|
||||
if ! grep -qE 'uart_2ndstage=1' "$CONFIG" 2>/dev/null; then
|
||||
echo "Adding uart_2ndstage=1 to $CONFIG (GPU firmware logs to UART for netboot debug)"
|
||||
echo "" >> "$CONFIG"
|
||||
echo "# GPU firmware logs to UART (see MESS: lines after PCI0 reset)" >> "$CONFIG"
|
||||
echo "uart_2ndstage=1" >> "$CONFIG"
|
||||
CHANGED=1
|
||||
fi
|
||||
|
||||
if [[ "$CHANGED" -eq 1 ]]; then
|
||||
echo "Config updated. Ensure $TFTP_ROOT has kernel8.img and initrd.img."
|
||||
else
|
||||
echo "Config already has kernel and initramfs lines."
|
||||
echo "Config already has kernel, initramfs and uart_2ndstage lines."
|
||||
fi
|
||||
grep -E 'kernel|initramfs' "$CONFIG" 2>/dev/null || true
|
||||
grep -E 'enable_uart|kernel|initramfs|uart_2ndstage' "$CONFIG" 2>/dev/null || true
|
||||
|
||||
# Ensure serial-prefix dir gets a real copy of config (some TFTP servers don't follow symlinks)
|
||||
# Ensure serial-prefix dirs get a real copy of config and symlinks to DTB files.
|
||||
# GPU loads kernel/initrd/dtb from the serial prefix; missing DTBs cause "Failed to load Device Tree file '?'" and the kernel can hang.
|
||||
for serial_dir in "$TFTP_ROOT"/[0-9a-f]*/; do
|
||||
[[ -d "$serial_dir" ]] || continue
|
||||
if [[ -L "$serial_dir/config.txt" ]] || [[ ! -f "$serial_dir/config.txt" ]]; then
|
||||
rm -f "$serial_dir/config.txt"
|
||||
cp "$CONFIG" "$serial_dir/config.txt"
|
||||
echo "Copied config.txt into $(basename "$serial_dir")/ (real file) so device gets full config."
|
||||
fi
|
||||
rm -f "$serial_dir/config.txt"
|
||||
cp "$CONFIG" "$serial_dir/config.txt"
|
||||
echo "Copied config.txt into $(basename "$serial_dir")/ (real file) so device gets full config."
|
||||
for dtb in "$TFTP_ROOT"/*.dtb; do
|
||||
[[ -f "$dtb" ]] || continue
|
||||
base=$(basename "$dtb")
|
||||
if [[ ! -e "$serial_dir/$base" ]]; then
|
||||
ln -sf "../$base" "$serial_dir/$base"
|
||||
echo "Linked $base into $(basename "$serial_dir")/"
|
||||
fi
|
||||
done
|
||||
done
|
||||
|
||||
Reference in New Issue
Block a user