diff --git a/emmc-provisioning/docs/NETWORK-BOOT-TROUBLESHOOTING.md b/emmc-provisioning/docs/NETWORK-BOOT-TROUBLESHOOTING.md index 49e4373..bd93888 100644 --- a/emmc-provisioning/docs/NETWORK-BOOT-TROUBLESHOOTING.md +++ b/emmc-provisioning/docs/NETWORK-BOOT-TROUBLESHOOTING.md @@ -129,15 +129,27 @@ and you **never** see a line about network (e.g. "Trying DHCP", "TFTP", or "Boot ## Boot stops after start4.elf ("PCI0 reset" then nothing) -If the serial log shows **TFTP** for config.txt, start4.elf, fixup4.dat, then **"Starting start4.elf"**, **"Stopping network"**, **"PCI0 reset"**, and **no** TFTP requests for **kernel8.img** or **initrd.img**, the bootloader is not loading the kernel. That usually means **config.txt** in the TFTP root does not have the **kernel** and **initramfs** lines. +### What’s actually going on -**Fix on the LXC:** ensure `/srv/tftpboot/config.txt` contains (and that `0d1ddbda/config.txt` is a symlink to it or has the same content): +The **EEPROM bootloader** only does TFTP for config.txt, start4.elf, and fixup4.dat. It then **starts the GPU firmware (start4.elf)** and **stops the network**. The **kernel and initrd are loaded by the GPU firmware**, not by the EEPROM: after “Starting start4.elf”, the GPU is supposed to bring the network back up and TFTP kernel8.img, cmdline.txt, and initrd.img. If you never see TFTP for kernel8.img/initrd.img and the log stops at “PCI0 reset”, the GPU stage is not doing that. Common causes: + +1. **Config not seen by the GPU** — The config the EEPROM fetched (e.g. from `0d1ddbda/config.txt`) must contain `kernel=kernel8.img` and `initramfs initrd.img followkernel`. If that file was a symlink or truncated, the GPU may not see those lines. Use a **real copy** of the full config in the serial dir (see ensure script below). +2. **No visibility into the GPU** — The EEPROM logs stop at “PCI0 reset”; the next step is inside the GPU firmware. To see GPU messages (e.g. network bring-up, TFTP, or errors), add **`uart_2ndstage=1`** to config.txt so the GPU logs to the UART. Then power-cycle and watch for lines like `MESS:... genet: LINK STATUS` or TFTP activity. +3. **Firmware/board quirk** — On some boards or firmware versions the GPU netboot path can fail silently. Ensuring the latest Pi 4/CM4 boot files in the TFTP root and trying **start4cd.elf** + **fixup4cd.dat** (or leaving defaults) is worth a try. + +If the serial log shows **TFTP** for config.txt, start4.elf, fixup4.dat, then **"Starting start4.elf"**, **"Stopping network"**, **"PCI0 reset"**, and **no** TFTP requests for **kernel8.img** or **initrd.img**, use the checks below. + +**Fix on the LXC:** ensure `/srv/tftpboot/config.txt` contains (and that `0d1ddbda/config.txt` is a real copy with the same content): ```ini +enable_uart=1 kernel=kernel8.img initramfs initrd.img followkernel +uart_2ndstage=1 ``` +`enable_uart=1` is required for the kernel serial console when netbooting (otherwise the firmware can set 8250.nr_uarts=0). `uart_2ndstage=1` makes the GPU firmware log to the UART so you see **MESS:** lines after "PCI0 reset" (e.g. network bring-up, TFTP, or errors). + You can run: ```bash @@ -147,6 +159,30 @@ ssh root@ 'bash -s' < emmc-provisioning/scripts/ensure-tftpboot-config-k Also ensure the TFTP root has **kernel8.img** and **initrd.img** (and the serial subdir has symlinks or copies). Then power-cycle the device; you should see TFTP_GET for kernel8.img and initrd.img, then the kernel and initramfs (e.g. rescue shell or provisioning client) run. +**If it still stops after “PCI0 reset”:** + +- Add **`uart_2ndstage=1`** to the TFTP config.txt (root and serial copy). Re-run the ensure script so the serial dir gets the updated config, then power-cycle. Watch the serial log for **MESS:** lines from the GPU (e.g. `genet: LINK STATUS`, TFTP, or errors). That shows whether the GPU is bringing the network up and trying to load the kernel. +- On the LXC, confirm the config the device gets has the right size and content: + `ssh root@ 'wc -c /srv/tftpboot/0d1ddbda/config.txt && grep -E "kernel|initramfs|uart_2ndstage" /srv/tftpboot/0d1ddbda/config.txt'` + +--- + +## Kernel loads but serial stops at "Baud rate change done" (no rescue shell) + +If you see the GPU load kernel8.img and initrd.img, then **"Baud rate change done..."** and nothing else (no rescue shell, no kernel messages), the kernel is likely hanging very early because of a **missing or invalid Device Tree**. The GPU log may show **`dterror: Failed to load Device Tree file '?'`**. + +The GPU loads files from the **serial-prefix** dir (e.g. `0d1ddbda/`). If the **.dtb** files (e.g. `bcm2711-rpi-cm4.dtb`, `bcm2711-rpi-cm4-io.dtb`) are only in the TFTP root and not in that dir, the firmware can fail to load the right DTB and the kernel gets no valid device tree. + +**Fix:** Ensure the TFTP root has the Pi 4/CM4 DTB files (from the [Raspberry Pi firmware](https://github.com/raspberrypi/firmware) `boot/` folder) and that each **serial-prefix** dir has symlinks to them. Re-run the ensure script (it now links `*.dtb` into each serial dir): + +```bash +ssh root@ 'bash -s' < emmc-provisioning/scripts/ensure-tftpboot-config-kernel-initrd.sh +``` + +If the TFTP root has no `*.dtb` files, populate it from the Pi firmware (e.g. run `populate-tftpboot-from-git.sh` or copy `bcm2711-rpi-cm4.dtb`, `bcm2711-rpi-cm4-io.dtb`, and other `bcm2711*.dtb` from the firmware repo into `/srv/tftpboot`), then run the ensure script again and power-cycle the device. + +**Serial stops at "Baud rate change done" (no kernel/initramfs output):** On Pi 4/CM4 netboot, the firmware can force **8250.nr_uarts=0**, which disables the kernel serial driver so you get no console after the GPU handoff ([raspberrypi/firmware#1575](https://github.com/raspberrypi/firmware/issues/1575)). The workaround is **`enable_uart=1`** in config.txt (within the first 4KB). The ensure script adds it; re-run the script so the root and serial-prefix configs have it, then power-cycle. Keep serial at **115200** baud. + --- ## TFTP "file .../SERIAL/start4.elf not found" — serial-number prefix diff --git a/emmc-provisioning/network-boot-initramfs/build.sh b/emmc-provisioning/network-boot-initramfs/build.sh index 127b8f4..430715e 100755 --- a/emmc-provisioning/network-boot-initramfs/build.sh +++ b/emmc-provisioning/network-boot-initramfs/build.sh @@ -144,13 +144,18 @@ if [ ! -f "$BUILD_DIR/bin/busybox" ] || [ ! -s "$BUILD_DIR/bin/busybox" ]; then fi chmod +x "$BUILD_DIR/bin/busybox" 2>/dev/null || true -# Busybox applets we need (sh, mount, udhcpc, etc.) +# Busybox applet symlinks (mount, mkdir, etc.). When building arm64 on x86, busybox cannot be run so --list fails; create symlinks manually. +APPLETS="sh ash mount umount mkdir cat ip udhcpc sleep echo grep cut awk hostname dd reboot chroot ls rm" cd "$BUILD_DIR/bin" -./busybox --list 2>/dev/null | while read applet; do - case "$applet" in - sh|ash|mount|umount|mkdir|cat|ip|udhcpc|sleep|echo|grep|cut|awk|hostname|dd|reboot) ln -sf busybox "$applet"; ;; - esac -done +if ./busybox --list >/dev/null 2>&1; then + ./busybox --list | while read applet; do + case " $APPLETS " in *" $applet "*) ln -sf busybox "$applet"; ;; esac + done +else + for applet in $APPLETS; do + [ -e "$applet" ] || ln -sf busybox "$applet" + done +fi [ -e sh ] || ln -sf busybox sh # Build cpio (gzip) diff --git a/emmc-provisioning/network-boot-initramfs/init b/emmc-provisioning/network-boot-initramfs/init index 7c815c8..e0660bb 100644 --- a/emmc-provisioning/network-boot-initramfs/init +++ b/emmc-provisioning/network-boot-initramfs/init @@ -15,10 +15,11 @@ mount -t devtmpfs none /dev mkdir -p /dev/pts mount -t devpts none /dev/pts -# Kernel might have brought up eth0 via ip=dhcp; ensure we have an IP +# Kernel might have brought up eth0 via ip=dhcp; ensure we have an IP (run in background with timeout so we don't block rescue shell) if ! ip addr show | grep -q 'inet .* scope global'; then echo "Getting DHCP lease..." - udhcpc -f -q -i eth0 -n 2>/dev/null || true + ( udhcpc -f -q -i eth0 -n -T 5 2>/dev/null || true ) & + sleep 6 fi # Allow kernel cmdline to override: provisioning_server=... and rescue mode @@ -35,7 +36,8 @@ export PROVISIONING_SERVER if [ "$RESCUE" -eq 1 ]; then echo "=== RESCUE MODE (provisioning_rescue=1) ===" echo "Run /rescue-eeprom.sh to mount eMMC and change boot order (rpi-eeprom-config), then reboot." - echo "Or run /bin/sh for a shell." + # Ensure shell I/O goes to serial console (some setups drop output otherwise) + [ -c /dev/console ] && exec /dev/console 2>&1 exec /bin/sh -i fi diff --git a/emmc-provisioning/network-boot-initramfs/initrd.img b/emmc-provisioning/network-boot-initramfs/initrd.img index c3ca5b6..a764ce8 100644 Binary files a/emmc-provisioning/network-boot-initramfs/initrd.img and b/emmc-provisioning/network-boot-initramfs/initrd.img differ diff --git a/emmc-provisioning/scripts/check-dhcp-network-boot-on-lxc.sh b/emmc-provisioning/scripts/check-dhcp-network-boot-on-lxc.sh new file mode 100755 index 0000000..1dbad72 --- /dev/null +++ b/emmc-provisioning/scripts/check-dhcp-network-boot-on-lxc.sh @@ -0,0 +1,24 @@ +#!/usr/bin/env bash +# Check whether DHCP network-boot options (66/67) are enabled on the LXC. +# Usage: ./check-dhcp-network-boot-on-lxc.sh [LXC_HOST] +# Example: ./check-dhcp-network-boot-on-lxc.sh root@10.20.30.153 + +LXC="${1:-root@10.20.30.153}" +PXE_CONF="/etc/dnsmasq.d/network-boot-pxe.conf" + +echo "Checking DHCP network-boot status on $LXC ..." +ssh "$LXC" "bash -s" << 'REMOTE' +PXE_CONF="/etc/dnsmasq.d/network-boot-pxe.conf" +if [ -f "$PXE_CONF" ]; then + echo "Status: ENABLED (option 66/67 are advertised - devices will try network boot)" + echo "Content of $PXE_CONF:" + cat "$PXE_CONF" +else + echo "Status: DISABLED (no PXE options - devices get DHCP only and boot from local storage)" +fi +# Also show toggle script status if present +if [ -x /opt/cm4-provisioning/toggle-network-boot-dhcp.sh ]; then + echo "" + echo "Toggle script output: $(/opt/cm4-provisioning/toggle-network-boot-dhcp.sh status 2>/dev/null)" +fi +REMOTE diff --git a/emmc-provisioning/scripts/ensure-tftpboot-config-kernel-initrd.sh b/emmc-provisioning/scripts/ensure-tftpboot-config-kernel-initrd.sh index 468af84..8850c16 100755 --- a/emmc-provisioning/scripts/ensure-tftpboot-config-kernel-initrd.sh +++ b/emmc-provisioning/scripts/ensure-tftpboot-config-kernel-initrd.sh @@ -1,6 +1,6 @@ #!/usr/bin/env bash -# Ensure TFTP config.txt on the LXC has kernel=kernel8.img and initramfs initrd.img followkernel -# so the bootloader loads the kernel and initrd (otherwise boot stops after start4.elf). +# Ensure TFTP config.txt on the LXC has kernel=kernel8.img, initramfs initrd.img followkernel, +# and uart_2ndstage=1 (GPU firmware logs to UART for netboot debugging). # Run on LXC: bash ensure-tftpboot-config-kernel-initrd.sh # Or: ssh root@10.20.30.153 'bash -s' < emmc-provisioning/scripts/ensure-tftpboot-config-kernel-initrd.sh @@ -14,6 +14,12 @@ if [[ ! -f "$CONFIG" ]]; then fi CHANGED=0 +# enable_uart=1 must be present (and within first 4KB of config) so netboot firmware sets 8250.nr_uarts=1; else kernel has no serial console (Pi firmware #1575). +if ! grep -qE 'enable_uart=1' "$CONFIG" 2>/dev/null; then + echo "Adding enable_uart=1 to $CONFIG (required for kernel serial on netboot)" + echo "enable_uart=1" >> "$CONFIG" + CHANGED=1 +fi if ! grep -qE '^kernel=kernel8\.img' "$CONFIG" 2>/dev/null; then echo "Adding kernel=kernel8.img to $CONFIG" echo "kernel=kernel8.img" >> "$CONFIG" @@ -26,20 +32,34 @@ if ! grep -qE 'initramfs initrd\.img' "$CONFIG" 2>/dev/null; then echo "initramfs initrd.img followkernel" >> "$CONFIG" CHANGED=1 fi +if ! grep -qE 'uart_2ndstage=1' "$CONFIG" 2>/dev/null; then + echo "Adding uart_2ndstage=1 to $CONFIG (GPU firmware logs to UART for netboot debug)" + echo "" >> "$CONFIG" + echo "# GPU firmware logs to UART (see MESS: lines after PCI0 reset)" >> "$CONFIG" + echo "uart_2ndstage=1" >> "$CONFIG" + CHANGED=1 +fi if [[ "$CHANGED" -eq 1 ]]; then echo "Config updated. Ensure $TFTP_ROOT has kernel8.img and initrd.img." else - echo "Config already has kernel and initramfs lines." + echo "Config already has kernel, initramfs and uart_2ndstage lines." fi -grep -E 'kernel|initramfs' "$CONFIG" 2>/dev/null || true +grep -E 'enable_uart|kernel|initramfs|uart_2ndstage' "$CONFIG" 2>/dev/null || true -# Ensure serial-prefix dir gets a real copy of config (some TFTP servers don't follow symlinks) +# Ensure serial-prefix dirs get a real copy of config and symlinks to DTB files. +# GPU loads kernel/initrd/dtb from the serial prefix; missing DTBs cause "Failed to load Device Tree file '?'" and the kernel can hang. for serial_dir in "$TFTP_ROOT"/[0-9a-f]*/; do [[ -d "$serial_dir" ]] || continue - if [[ -L "$serial_dir/config.txt" ]] || [[ ! -f "$serial_dir/config.txt" ]]; then - rm -f "$serial_dir/config.txt" - cp "$CONFIG" "$serial_dir/config.txt" - echo "Copied config.txt into $(basename "$serial_dir")/ (real file) so device gets full config." - fi + rm -f "$serial_dir/config.txt" + cp "$CONFIG" "$serial_dir/config.txt" + echo "Copied config.txt into $(basename "$serial_dir")/ (real file) so device gets full config." + for dtb in "$TFTP_ROOT"/*.dtb; do + [[ -f "$dtb" ]] || continue + base=$(basename "$dtb") + if [[ ! -e "$serial_dir/$base" ]]; then + ln -sf "../$base" "$serial_dir/$base" + echo "Linked $base into $(basename "$serial_dir")/" + fi + done done