Troubleshooting =============== .. image:: _static/things_can_go_right.png :alt: Illustration of MagAO-X with motto: "Having more things just means more things can go right" :align: center :scale: 33% Figuring out what exactly isn't working --------------------------------------- To narrow down the failing component, use ``xctrl status`` to see if any MagAO-X apps are not running. The typical MagAO-X app is started by ``xctrl startup`` based on a line in a config file in ``/opt/MagAOX/config/proclist_$MAGAOX_ROLE.txt``. This proclist determines which application to start and which config file from ``/opt/MagAOX/config`` should be supplied as the ``-n`` option (see :doc:`Standard options `). It also uses ``sudo`` to run the process as user ``xsup``, regardless of which user called ``xctrl startup``. Many, if not all, MagAO-X apps are intended to run “forever” (i.e until shutdown). If the process is ``dead``, you can attach to the ``tmux`` session that’s the parent of the process in question with ``xctrl inspect PROCNAME`` (where ``PROCNAME`` is the name of the failed process). This will occasionally reveal error messages that did not get to the log. For example, if ``trippLitePDU`` is started by ``xctrl startup`` with config specified by ``-n pdu0`` and there’s a syntax error in ``/opt/MagAOX/config/pdu0.conf`` preventing startup, you can attach to the tmux session with .. code-block:: bash yourlogin$ xctrl inspect pdu0 The errors before exit, if any, will be in the log. The last few lines of the log can be checked with ``logdump -f pdu0``. The command that started the app will be of the form ``/opt/MagAOX/bin/$appName -n $configName``. You can use the up-arrow key in the tmux session to retrieve it from the shell history and try to relaunch once you’ve corrected whatever error was preventing startup. Addressing specific issues -------------------------- .. _missing_lowfs: Missing ``camflows`` or ``camllowfs`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When booting ICC, ensure that ``camflowfs`` and ``camllowfs`` are already powered on. Otherwise, you will have to reboot ICC to successfully connect. Things to try: Hotplugging ^^^^^^^^^^^ 1. begin with both cameras powered off 2. on exao3/ICC: ``xctrl shutdown camflowfs camllowfs`` 3. power on both cameras 4. as a non-xsup user, on exao3/ICC, run ``sudo /opt/pvcam/drivers/in-kernel/pcie/hotplug_pcie.sh`` 5. verify the number of "active cameras" it reports, and that none say ``no name found, rescan or PC reboot needed`` 6. on exao3/ICC: ``xctrl startup camflowfs camllowfs`` Outcome 1: successful hotplugging """"""""""""""""""""""""""""""""" The ``camflowfs.fsm`` / ``camllowfs.fsm`` property will change to ``OPERATING`` and images will start to appear. Outcome 2: unsuccessful hotplugging """"""""""""""""""""""""""""""""""" When hotplugging is unsuccessful, you will see output like the below that includes ``no name found, rescan or PC reboot needed``. :: $ sudo /opt/pvcam/drivers/in-kernel/pcie/hotplug_pcie.sh Detected 1 supported card(s) 0000:41:00.0 - upstream port (Dolphin PXH832 card [10b5:8733]) 0000:42:09.0 - downstream port No camera found 0000:42:08.0 - downstream port No camera found Found 0 active camera(s) cat: /sys/module/pvcam_pcie/refcnt: No such file or directory Performing hot-reset of downstream ports... 0000:42:08.0 0000:42:09.0 Removing active cameras... Rescanning downstream ports... 0000:42:08.0 0000:42:09.0 Detected 1 supported card(s) 0000:41:00.0 - upstream port (Dolphin PXH832 card [10b5:8733]) 0000:42:09.0 - downstream port 0000:44:00.0 - camera rev. 8 (no name found, rescan or PC reboot needed) 0000:42:08.0 - downstream port 0000:43:00.0 - camera rev. 8 (no name found, rescan or PC reboot needed) Found 2 active camera(s) *This section TODO* Shared memory image problems with “No space left on device” errors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When starting MagAO-X apps or CACAO apps that use shared memory images, the ImageStreamIO library will try to create shared memory images on ``/milk/shm``. This can fail with an error like: :: ERROR [ FILE: /opt/MagAOX/source/cacao/src/ImageStreamIO/ImageStreamIO.c FUNCTION: ImageStreamIO_createIm_gpu LINE: 521 ] C Error: No space left on device Indeed, if you use ``df -h``, you’ll see that ``/milk/shm`` is full: .. code-block:: bash $ df -h Filesystem Size Used Avail Use% Mounted on [...] tmpfs 63G 63G 0 100% /milk/shm [...] The solution is to :doc:`shut down ` and then clear ``/milk/shm``. .. code-block:: bash you$ xsupify xsup$ cd /milk/shm xsup$ rm * If rerunning ``df -h`` *still* doesn’t show any space available, something is probably holding a reference to the files. (See `this SuperUser question `__.) You should reboot the computer with ``sudo reboot`` (having already shut down / rested any hardware). Loop failing to close for no apparent reason and/or intermittent failures of CACAO calibration process ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Believe it or not, this can be a sign of insufficient disk space. Consult ``df -h`` and see if any of the filesystems have ``Use%`` of 100%. This can also be checked in INDI with the sysMonitor process for the relevant compute (sysMonRTC, sysMonICC). .. _missing_gpu: Lockup / Missing GPUs / ``nvidia-smi`` errors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Our computers with PCIe expansion cards will occassionally lock up, or will lose a GPU (``GPU has fallen off the bus`` errors). Sometimes running ``nvidia-smi`` fails with ``Unable to determine the device handle for GPU 0000:8C:00.0: GPU is lost. Reboot the system to recover this GPU.`` GPU telemetry will also disappear from the monitoring dashboard. 1. If the system is responding: 1. If you were using the system, rest any attached hardware and begin camera warmup. (You don't have to wait for them to reach the warmup temperature.) (For RTC: woofer, tweeter, ttmmod, ttmpupil, and camwfs.) 2. Shutdown (requires sudo) .. code-block:: bash [user@exaoN ~]$ sudo shutdown -h now 3. Now "press the power button" using the Moxa IO unit (see the ICC or RTC Power-On section for that computer in the :doc:`System Power On ` procedure) 2. If the system is not responding, GPUs continue to fall off the bus, or ``nvidia-smi`` errors persist after following the procedure above: 1. If you can, perform steps 1.1 and 1.2 above to bring the system down in an orderly fashion. 2. Power down ``pdu0.comprtc`` or ``pdu.compicc`` (e.g. with **pwrGUI**) 3. Wait at least 10 seconds. 4. Now perform all of the ICC or RTC Power-On steps from the :doc:`System Power On ` procedure. OCAM connectivity / bad data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ OCAM connects over two CameraLink connections. CameraLink #1 carries serial communication with the detector, so if you’re able to command the camera but your data appear bad in ``rtimv camwfs``, the culprit is likely the CameraLink #2 cable. Reseat, on ICC do ``xctrl restart camwfs``, and restart ``rtimv``. Alpao DM not responding ~~~~~~~~~~~~~~~~~~~~~~~ Make sure it has been initialized. There is an ``initialize_alpao`` systemd unit that runs at boot and initializes the interface card. Successful execution looks like this in ``systemctl status initialize_alpao`` output: .. code-block:: bash $ systemctl status initialize_alpao ● initialize_alpao.service - Initialize Alpao interface card Loaded: loaded (/opt/MagAOX/config/initialize_alpao.service; enabled; vendor preset: disabled) Active: active (exited) since Sun 2019-09-29 11:18:34 MST; 20min ago Process: 4449 ExecStart=/opt/MagAOX/config/initialize_alpao.sh (code=exited, status=0/SUCCESS) Main PID: 4449 (code=exited, status=0/SUCCESS) CGroup: /system.slice/initialize_alpao.service Sep 29 11:18:34 exao3.as.arizona.edu systemd[1]: Started Initialize Alpao interface card. Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: ==================================================================== Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: Ref.ID | Model | RSW1 | Type | Device No. Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: -------------------------------------------------------------------- Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: 1 | PEX-292144 | 0 | DI | 17 Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: -------------------------------------------------------------------- Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: 2 | PEX-292144 | 0 | DO | 18 Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: ==================================================================== The script is saved at ``/opt/MagAOX/config/initialize_alpao.sh``, if you want to see what it’s doing. Note that executing it again will appear to fail with a message about not finding cards to initialize if the cards have been previously initialized. DM Latency and Communication Troubleshooting ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There are various ways that the shared memory interprocess communication between the deformable mirrors, loop control(s), and the hardware control processes can stop functioning properly. Examples with known fixes: ^^^^^^^^^^^^^^^^^^^^^^^^^^ - Inability to set or zero flat or test from the dm control gui - This likely points to a bad semaphore. Simply release DM, then re-initialize, and it usually clears. If not, go to more general steps below. - Excessive latency, occurs especially for ALPAOs - This usually requires a power cycle of the driver itself. Release the DM, then use the power control GUI to turn off, then on the DM driver. - Skipped commands - This is possibly caused by collisions on a semaphore, meaning more than one process is monitoring a given semaphore. This can be diagnosed with ``streamCTRL``. If this is not the case, a full software shutdown (both cacao and magao-x) and clearing the /milk/shm and /dev/shm directories (rm \*), then restarting, should clear the problem. See step 5 below. General Troubleshooting ^^^^^^^^^^^^^^^^^^^^^^^ General troubleshooting steps, in order of severity (try the lower ones first if you don’t have a clear idea what the problem is): 1) release, then initialize from the ``dmCtrl`` GUI 2) release, then restart the DM controller software, e.g. for the woofer: .. code-block:: bash rtc$ xctrl restart dmwoofer 1. restart the CACAO process that combines the DM shmims: - first stop the DM controller (see above) - restart ``dmcomb`` (or testbed equivalent) using ``fpsCTRL`` - run ``fpsCTRL`` - select process to restart with arrow keys - hit lower-case ``r`` to stop the process - hit upper-case ``R`` to start it again - restart the DM controller (see above) Note: this may cause problems in some other processes due to shmim recreation. 2. Power cycle the DM - release from the ``dmCtrl`` GUI - turn off the power with the ``pwrCtrl`` GUI, then turn it back on - if it doesn’t happen automatically, initialize the DM from the GUI when it has power - if this does not fix the problem, try steps 1-3 again. 3. Full Software Restart - Place all hardware controlled from this computer in a safe condition - rest ``modttm`` and ``ttmpupil`` - start camera warmup (in case you can’t get software back up) - release all DMs controlled from this computer - Shutdown all software with: .. code-block:: bash rtc$ xctrl shutdown rtc$ tmux kill-server # for cacao processes not managed by xctrl - Clear all shared memory: .. code-block:: bash rtc$ cd /milk/shm rtc$ sudo rm * rtc$ cd /dev/shm rtc$ sudo rm * - Now restart software and restore hardware to operating condition 4. Reboot - This is a last resort. This may be necessary if a problem has developed in the device driver for instance. - Follow procedure for computer reboot. Ensure all hardware is in a safe condition, including powered-off if needed, before rebooting. EDT Framegrabber Problems (camwfs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The EDT PCIe framegrabber occassionally stops responding. The main symptom of this is no data from ``camwfs``, and no response on the serial over camera link. If ``camwfs`` stops responding on serial (evident in logs, probably frame corruption), first shutdown the controlling application. .. code-block:: bash $ xctrl shutdown camwfs You will next need to switch from user ``xsup`` to yourself: .. code-block:: bash $ su then do these steps to reload the EDT driver: .. code-block:: bash $ cd /opt/EDTpdv $ sudo ./edt_unload $ sudo ./edt_load This will reset the kernel module and restore operation. Now return to ``xsup`` and restart the controlling application: .. code-block:: bash $ exit $ xctrl startup camwfs #<-change if a different camera After this occurs, you will need to re-start the CACAO loop processes so they re-connect to the camwfs shmim. Camsci1/2 not responding ~~~~~~~~~~~~~~~~~~~~~~~~ If ``camsci1`` and/or ``camsci2`` stop responding, first attempt to restart the control software with ``xctrl restart``. If this does not restore operation, the PICam library needs to be reset. Perform the following steps: 1. Turn power off for both cameras. Note that you will not be able to verify detector temperature but this can not be avoided. 2. Stop both ``camsci`` control processes. Either use xctrl or go to the tmux session and use ctrl-c. 3. In a terminal on ICC, go to ``/opt/MagAOX/source/MagAOX/apps/picamCtrl`` and run the script ``cleanPI.sh`` as root. This removes lock files. 4. Re-start both control processes. 5. Power up both cameras rtimv not updating ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If `rtimv` is not showing images: #. If remote, check that you have opened the right tunnel. Usually to port 9000 on AOC. #. Check that the camera is actually updating. You can use: .. code-block:: bash [icc/rtc]$ milk-shmimmon camname on the computer which hosts the camera. If not, try restarting the camera control software and then take camera specific troubleshooting steps. #. Check that the `mzmqServer` process on the computer which hosts the camera is alive #. Check that the `mzmqClient` process for the computer which hosts the camera is alive on AOC #. Check that frames are arriving on AOC with: .. code-block:: bash aoc$ milk-shmimmon camname #. If remote, check that `mzmqServerAOC` is alive on AOC #. If you are using the `-Z` option with rtimv (milkzmq direct), try: a. Start a `milkzmqClient` instance: .. code-block:: bash $ milkzmqClient -p 9000 localhost camname #assumes you have the tunnel open it will tell you if it begins receiving images and you will see a file show up in `/milk/shm`. b. Next try using `milk-shmimmon camname` or `rtimv camname`. Killing INDI zombies ~~~~~~~~~~~~~~~~~~~~ If the ``indiserver`` crashes uncleanly (itself a subprocess of `xindiserver `_), the associated ``xindidriver`` processes may become orphans (i.e. reparented to PID 1 (init)). This will prevent `xindiserver `_ from starting again until these processes have been killed. (There will be output in logdump suggesting you ``kill the zombies``.) Often using `xctrl` to restart the INDI server will fail due to the zombies. You can restart the INDI server efficiently by going to the tmux session for it: .. code-block:: bash $ tmux a -t isRTC where you change `isRTC` to the correct process name, such as `isAOC` or `isICC`. You need to be `xsup`. If the INDI server is down you will see error messages followed by a prompt. The following script will kill the zombies: .. code-block:: bash $ killIndiZombies Now you can up-arrow to find the INDI server command. It should be: .. code-block:: bash $ /opt/MagAOX/bin/xindiserver -n isRTC Hit enter to startup the server. Once it is running you exit the tmux session with `ctrl-b` then `d`. Now you can test the INDI server is running with .. code-block:: bash $ getINDI which should return all the INDI properties on that machine. Difficulties with NVIDIA proprietary drivers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. When installing, ensure you have ``systemctl set-default multi-user.target`` and a display is connected **only** to the VGA header provided by the motherboard 2. If NVIDIA graphical output did work, and now doesn’t: Your kernel may have been updated, requiring a rebuild of the NVIDIA driver. Having ``dkms`` installed *should* prevent needing to do this, but an uninstall and reinstall over SSH will also remedy it. 3. Runfile installs can be uninstalled with ``/usr/local/cuda/bin/cuda-uninstaller``. This may leave a vestigial ``/usr/local/cudaXX.YY`` folder (where ``XX.YY`` is a version number) that can most likely be safely removed. (It's probably just some temporary files that the installer didn't create and is too polite to remove.) Computer Fails to Boot ~~~~~~~~~~~~~~~~~~~~~~ There may be several reasons for this. Examples with known fixes: ^^^^^^^^^^^^^^^^^^^^^^^^^^ - Startup screen frozen at "initalizing" and Q-Code A9 - This probably means that the BIOS has lost its setup, and is trying to use a GPU for video display - Shutdown and fully power down. - If you have a new mobo CR2032 battery, replace it now - Remove GPUs (i.e. by disconnecting the PCIe expansion cable from the host card on the mobo). - Install the VGA cable on the mobo (see manual for location) - Alternatively, you may be able to plug a monitor into the GPU - Boot, and press the del key over and over again until you see "Enter Setup" in the lower right corner. - Follow the BIOS setup guide - Reboot (F10, save settings). - Now shutdown, fully power down, and reinstall/reconnect all GPUS. - Reboot. USB Device Communication Problems ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If USB controlled devices, such as filter wheels, focus stages, and rotation stages, have errors such as:: ERRNO: -42001 [Unknown error -42001] >TTY: tcgetattr returned error or:: USB Device 0403:6001:A9EF0AMU not found in udev or similar, try these things: .. note:: As of 2024A we are seeing occasional near-total scrambling of USB communications at LCO, probably due to grounding problems. If many, essentially all, USB devices appear to be having problems skip to step 3. 1. Power cycle the problem device. - Note that not all USB devices have power control. In this case skip to step 2. - Be sure to power cycle both main power and the USB power if necessary 2. If power cycling the device did not fix it (or it doesn't have power control), next restart the software controller. This may be necessary after power-cycling if the USB device was re-enumerated on the motherboard. - Use `xctrl restart xxxx` where xxxx is the name of the device - watch the logs to see if the device is "found in udev" 3. If the above steps do not work, the USB hub associated with the device may need to be reset. - The following devices are not on the main USB hub, but plugged directly into the computer - rhtweeter (RTC) - ttmpupil (RTC) - usbdu0 (RTC) - rhncpc (ICC) - temprack: lower and upper (ICC) - usbdu1 (ICC) For these devices you can try unplugging and replugging their USB cables directly on the motherboard - If the above direct connection devices are not fixed by re-plugging, the computer will have to be rebooted. Follow the procedure for doing so. - Most USB devices are connected to the main 16-port USB hub. This can be remotely power cycled to reboot it. - Power off `dcpwr` from the pdu using **pwrGUI**. Wait a couple seconds, and power it back on. - This will cause all of the USB devices to get new addresses/tty numbers, so the software will have to be restarted.It's probably easiest at this point to use `xctrl restart all` on ICC instead of restarting them one-by-one.