[Haller]

Troubleshooting

When something breaks — symptoms, likely causes, and fixes — across install, hardware, HMI, teleop, networking, and the data/VLA path.

Organized by what you see, not by which subsystem owns it. If you have an error message, search this page for it first.

The subsystem pages also each carry their own inline tips for context-specific issues — those are still the deeper reference; this page is the cross-cutting index.

Install and first-boot

Either the rules file didn't install or the devices aren't plugged in. After re-running sudo ./scripts/install.sh:

ls /etc/udev/rules.d/99-haller-devices.rules    # should exist
ls -l /dev/haller_can /dev/haller_lidar          # should be symlinks

If the rules file is present but the symlinks are missing, the device is unplugged or its USB IDs don't match (lsusb to find them, then edit scripts/99-haller-devices.rules).

ImportError: undefined symbol: ncclDevCommDestroy when importing torch

The lerobot conda env's activate hooks aren't in place — PYTHONNOUSERSITE=1 must be set so the env's own NCCL wheel isn't shadowed by an outdated ~/.local/... copy. See LeRobot environment, step 3.

pip check reports unrelated conflicts about kubernetes / google-genai / markitdown

These come from ~/.local/ or /usr/lib/python3.12/dist-packages/, not the env. Harmless — pip check walks every Python install on the box.

torch.cuda.is_available() returns False

Run nvidia-smi. If that fails, the NVIDIA driver isn't installed at the system level; the conda env brings its own CUDA runtime but expects a working host driver.

Forgetting to conda activate lerobot

A naked pip install from a non-activated shell sometimes silently satisfies deps from ~/.local/. Always activate first.

Hardware not detected

Motor 'gripper' (model 'sts3215') was not found

The servo's V+ pin has no power. The board's logic is alive (USB-powered) but the servo bus isn't. Confirm the barrel-jack supply is connected and on, and that your supply voltage matches the servo variant (7.4 V supply for 7.4 V STS3215, 12 V for the 12 V variant). USB alone never powers the servos. Full hardware checklist: SO-101 arms.

[RxPacketError] Input voltage error!

The servo sees voltage outside its safe range and is alarming. You're feeding the wrong supply for this servo's voltage variant. Power off, swap supply (e.g. drop from 12 V to 7.4 V for the 7.4 V variant), retry.

lerobot-setup-motors writes the same ID to multiple motors

You had more than one motor on the bus during ID configuration. The script writes ID n to every motor it can see — chain all of them up and you'll overwrite earlier ones. Connect motors one at a time, gripper first, in reverse-id order. See SO-101 arms, step 2.

CAN bus won't come up

sudo ip link set can0 type can bitrate 1000000
sudo ip link set up can0
# RTNETLINK answers: Device or resource busy

can0 is already up — sudo ip link set down can0 first, or just confirm with ip -d link show can0. If can0 doesn't exist at all, the CANable2 adapter isn't enumerated as SLCAN — dmesg | tail will show what driver claimed it.

LiDAR enumerates but /scan is empty

The RPLIDAR A1's motor needs USB-bus power; some hubs can't supply enough. Plug the LiDAR directly into the Jetson, or use a powered hub. ros2 topic echo /scan --once will hang silently if the motor isn't spinning.

The udev rule's idVendor / idProduct / serial triple doesn't match. Find the actual identity:

udevadm info -a -n /dev/ttyACM0 | grep -E '{idVendor}|{idProduct}|{serial}' | head -3

Then update scripts/99-haller-devices.rules and re-run sudo ./scripts/install.sh.

HMI won't start

systemctl status haller-hmi.service says "exited code=1" immediately

Tail the journal: journalctl -u haller-hmi.service -e. Two most common causes:

  • ~/venvs/haller-hmi/ doesn't exist — run LeRobot environment install.
  • hmi/frontend/.next/standalone/server.js doesn't exist — (cd hmi/frontend && pnpm install && pnpm build).

Backend startup fails with No calibration for arm ... (calibration_id=...)

The arm has no calibration file in either robots/so_follower/ or teleoperators/*/. The log includes the exact lerobot-calibrate command to run. Run it (or use the in-browser wizard), then restart haller-hmi.

Backend crashes at startup with Could not open port

The USB symlink doesn't exist. ls -l /dev/haller_arm_follower — if missing, the udev rules aren't installed or the board isn't plugged in. See "A device is enumerated but the symlink is missing" above.

Port already in use (:8000 or :3000)

sudo ss -tlnp 'sport = :8000'
sudo ss -tlnp 'sport = :3000'

Usually a previous run_hmi.sh that didn't clean up. Kill the pid; systemctl restart haller-hmi.service.

HMI is up but broken

Frontend shows "disconnected" in the live badge

Backend isn't reachable on NEXT_PUBLIC_BACKEND_URL. Confirm curl http://localhost:8000/health returns {"status":"ok"}. For remote-backend setups: NEXT_PUBLIC_BACKEND_URL=http://orin.local:8000 pnpm dev.

POST /arm/{id}/goal returns 409

The arm is in auto mode. Click manual first; the joint sliders, Home, and preset replay are enabled only in manual.

POST /arm/{id}/goal returns 404 "unknown arm id"

Either the arm id in the URL is wrong, or the arm is disabled in hmi/backend/config.yaml.

Sliders are disabled and don't move

Check the mode badge — only manual enables them. They're also disabled for any arm currently participating in teleop (leader/follower or human).

Joint sliders show 0° for everything

Telemetry is connected but the backend's state_snapshot is failing — most likely the calibration file doesn't match the configured calibration_id. Check the backend log for the arm-telemetry warning.

/health returns 503

ROS bringup hasn't finished yet, or it failed. systemctl status haller-robot.service will tell you. Topics need to be present for the HMI base panel to function.

Calibration banner stuck even after a successful calibration

The HMI caches the calibration/status response on the dashboard. Either reload the page, or send POST /calibration/{arm_id}/abort — the next status poll picks up the on-disk change.

Calibration wizard

"Calibration was aborted (bus error or arm disconnected). Close and retry."

A Feetech bus error during the sweep auto-aborts the session and disables the action buttons. Power-cycle the servo supply, confirm the arm still responds (lerobot-find-port), then start a fresh session.

Mode-change rejected with 409 while wizard is open

The arm is locked while a calibration session is active. Finish or abort the session, then change modes.

"A calibration session is already in progress."

Only one session at a time across the HMI. Either resume the existing one (the wizard re-attaches on page reload) or abort it via POST /calibration/{arm_id}/abort.

POST /calibration/{arm_id}/finish_sweep returns 422

A joint never moved during the sweep. The response lists the offending joints — wiggle them and try again.

Teleop

Leader↔follower position is offset (especially shoulder_lift)

The two arms were calibrated in different physical neutral poses. Re-run the calibration wizard on one arm while holding it in the same physical neutral pose as the other. Background: SO-101 arms.

POST /teleop/start returns 400 "leader and follower must be different arms"

Pick different arms in the two dropdowns (or click to swap).

POST /teleop/start (or /teleop/human/start) returns 409

The other teleop mode is already running — they're mutually exclusive. Stop the current session first.

Human teleop: one arm moves, the other doesn't

That side's hand tracking is lost (out of frame, occluded, or pose-quality below threshold). The HUD shows tracking lost (left|right). Re-enter the frame.

Human teleop: both arms freeze suddenly mid-drive

You released SPACE. By design — release SPACE → both arms freeze within ~16 ms. Press SPACE again to resume.

Human teleop: session drops to IDLE on its own

The keypoint WebSocket disconnected for longer than the 5 s grace window — usually a tab close, a network blip, or the browser sleeping a backgrounded tab. Restart the session.

Networking and AP

HallerRobot SSID doesn't appear

The WIFI_DEV in scripts/setup_ap.sh is hard-coded to wlP1p1s0. On other hardware run nmcli device | grep wifi and edit that constant. Restart with sudo systemctl restart haller-ap.service. Also check rfkill list — the radio may be soft-blocked.

Connected to HallerRobot but laptop gets no DHCP lease

NetworkManager's ipv4.method shared failed to spawn a DHCP server. journalctl -u NetworkManager shows the error. Usually transient — restart haller-ap.service.

Connected but 10.42.0.1:3000 doesn't respond

Confirm the AP profile is active: nmcli con show --active should list HallerAP. Confirm the HMI is bound to 0.0.0.0 (not 127.0.0.1): ss -tlnp | grep 3000 on the Jetson.

Need internet on the Jetson while debugging

The Orin Nano dev kit has one radio. AP mode and client Wi-Fi are mutually exclusive. sudo systemctl stop haller-ap.service (or setup_ap.sh stop to also clear the connection profile) and the Jetson rejoins your last Wi-Fi network.

Dataset collection and cameras

Camera 503 on /cameras/{id}/snapshot

Another process is holding the device. Common offenders: cheese, a previous scripts/run_hmi.sh that didn't shut down cleanly, the legacy web_teleop.py. Find it with fuser /dev/video2 (or matching device path).

scripts/record_dataset.sh refuses to start ("device busy")

The HMI is holding the arm bus or a camera. Stop it first: Ctrl-C the dev script, or sudo systemctl stop haller-hmi.service. The recording script prints the holding PID.

Recorded dataset has flat joint traces

The leader wasn't being driven (you started recording without moving it), or the follower was stuck in STOP mode. Use lerobot-visualize-dataset --episode-index=0 to verify the joint state/action arrays before training.

lerobot-record errors with an unknown observation key

Your dataset's camera key names don't match what the policy expects. For replay-eval, use the --rename flag — see RunPod inference / Observation-key mismatches.

VLA and RunPod

torch.cuda.OutOfMemoryError during finetune

Try in order: drop batch size (BATCH_SIZE=2 scripts/runpod/finetune_pi05_lora.sh ...); confirm --policy.gradient_checkpointing=true is set; move to a larger GPU (A100 40 GB).

π0.5 weights take forever to download

First-run download is ~9 GB. Cache them on a RunPod Network Volume mounted at /workspace:

export HF_HOME=/workspace/hf_cache

so they only download once across pod sessions.

hf auth login doesn't persist between pod sessions

Token isn't being baked into the pod. Either set it in the RunPod "Environment variables" UI as HF_TOKEN=..., or mount the network volume above and put the credentials there.

Replay-eval predicts wildly wrong actions on every frame

Either (a) the observation keys are misnamed and the policy isn't seeing your images (check the error message for the expected key — see RunPod inference / Observation-key mismatches), or (b) the policy genuinely doesn't understand your task without finetuning. (b) is expected for π0.5 out of the box.

ROS stack

ros2 topic list shows nothing under haller-robot.service

The launch is up but topics aren't appearing across processes. Confirm ROS_DOMAIN_ID matches in every shell you're using (both the service and your debug shell). The HMI service runs with ROS_DOMAIN_ID=0; set the same in your shell.

RViz from a remote machine can't see the topics

Same — ROS_DOMAIN_ID mismatch between Jetson and laptop. Also check the firewall on whichever side runs RViz (DDS uses UDP multicast on a wide port range; some corporate networks block it).

use_sim:=true and the robot still moves

The simulation flag toggles to ros2_control with fake actuators — but the launch file still starts the motor controller node unless the flag is plumbed through. Use enable_vision:=false use_sim:=true for laptop-only dev, and double-check no real motors are powered.

Where to look when nothing else helps

LayerFirst place to look
HMI servicejournalctl -u haller-hmi.service -e -f
ROS servicejournalctl -u haller-robot.service -e -f
AP servicejournalctl -u haller-ap.service -e -f and nmcli con show --active
HMI runtimeBrowser DevTools → Network tab (REST failures) and Console (frontend errors)
Arm buslerobot-find-port, then a python scripts/test_so101_arm.py --port ... --id ... smoke test
Camerascurl -o /tmp/x.jpg http://localhost:8000/cameras/{id}/snapshot
CAN buscandump can0 in one terminal while pinging a motor in another
ROS topicsros2 topic list, ros2 topic echo /<name> --once, ros2 doctor

If the issue still isn't obvious, file it at github.com/oscardvs/haller_ws/issues with the relevant journalctl excerpt, the symptom, and what you tried.

On this page

Install and first-bootscripts/install.sh doesn't create /dev/haller_* symlinksImportError: undefined symbol: ncclDevCommDestroy when importing torchpip check reports unrelated conflicts about kubernetes / google-genai / markitdowntorch.cuda.is_available() returns FalseForgetting to conda activate lerobotHardware not detectedMotor 'gripper' (model 'sts3215') was not found[RxPacketError] Input voltage error!lerobot-setup-motors writes the same ID to multiple motorsCAN bus won't come upLiDAR enumerates but /scan is emptyA device is enumerated but the symlink is missingHMI won't startsystemctl status haller-hmi.service says "exited code=1" immediatelyBackend startup fails with No calibration for arm ... (calibration_id=...)Backend crashes at startup with Could not open portPort already in use (:8000 or :3000)HMI is up but brokenFrontend shows "disconnected" in the live badgePOST /arm/{id}/goal returns 409POST /arm/{id}/goal returns 404 "unknown arm id"Sliders are disabled and don't moveJoint sliders show 0° for everything/health returns 503Calibration banner stuck even after a successful calibrationCalibration wizard"Calibration was aborted (bus error or arm disconnected). Close and retry."Mode-change rejected with 409 while wizard is open"A calibration session is already in progress."POST /calibration/{arm_id}/finish_sweep returns 422TeleopLeader↔follower position is offset (especially shoulder_lift)POST /teleop/start returns 400 "leader and follower must be different arms"POST /teleop/start (or /teleop/human/start) returns 409Human teleop: one arm moves, the other doesn'tHuman teleop: both arms freeze suddenly mid-driveHuman teleop: session drops to IDLE on its ownNetworking and APHallerRobot SSID doesn't appearConnected to HallerRobot but laptop gets no DHCP leaseConnected but 10.42.0.1:3000 doesn't respondNeed internet on the Jetson while debuggingDataset collection and camerasCamera 503 on /cameras/{id}/snapshotscripts/record_dataset.sh refuses to start ("device busy")Recorded dataset has flat joint traceslerobot-record errors with an unknown observation keyVLA and RunPodtorch.cuda.OutOfMemoryError during finetuneπ0.5 weights take forever to downloadhf auth login doesn't persist between pod sessionsReplay-eval predicts wildly wrong actions on every frameROS stackros2 topic list shows nothing under haller-robot.serviceRViz from a remote machine can't see the topicsuse_sim:=true and the robot still movesWhere to look when nothing else helps