Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem with USB device reboot loop #970

Open
comsyspro opened this issue Jun 8, 2024 · 37 comments
Open

problem with USB device reboot loop #970

comsyspro opened this issue Jun 8, 2024 · 37 comments

Comments

@comsyspro
Copy link

comsyspro commented Jun 8, 2024

With usbipd v4.2.0 this device reboots when you stop giving input signals as seen in the video. You can see it here if you stop rotating the controller the device reboots. As long as you rotate a knob or operate other controllers the device stays on. If you do nothing, the device will reboot about 1 second later. The device should stay stable without rebooting. When I go back to usbipd v3.0.0 the device works fine (it stays stable without reboot loops) but usbipd v3.0.0 can lead to other troubles (BSOD) so that's no alternative. So I guess that something changed in v4.2.0 which causes this issue for this USB device. The device also works for Windows (Client) to Linux (Server). Only Windows (Client) to Windows (Server) has this issue. I also tried usbipd bind -b 1-4 --force without an effect.

usbip_boot_loop.mp4

screenshot1

Here are captures of usbipd server (start capture - attaching the device - 3 reboots - detaching - stop capture):
FILE.zip

Here are captures of USBIP client (start capture - attaching the device - rotating knob - stopped rotating and device reboots - again rotation knob pressing a button - stopped doing anything and the device reboots again - stop capture):
usbip-ude.txt
usbip-flt.txt

What is the problem and how can I get the USB device to work?

Update:
Comparing usbipd v3.0.0 with v3.1.0 leads to the assumption that AttachedClient.cs may be the cause:

// 2) The URB should be canceled (with a race condition of it already being completed, of course).

I guess that here the AttachedEndpoint class closes all endpoints and that leads to a reboot of the device.

@dorssel
Copy link
Owner

dorssel commented Jun 8, 2024

  1. Can you find the last version of usbipd-win that works, and the first that does not? (There are many changes between 3.0.0 and 4.2.0)
  2. Can you follow https://github.com/dorssel/usbipd-win/wiki/Troubleshooting, and send the detailed log of usbipd server Logging:LogLevel:Default=Trace

@dorssel
Copy link
Owner

dorssel commented Jun 8, 2024

From the PCap I see two (similar) occurrences:

  1. line 7774: The input pipe errors (EPIPE)
  2. line 7775: Linux (driver or software?) immediately asks for a new transfer on the same pipe
  3. line 7804: Very soon after, Linux unlinks (cancels) the IN request from (2)
  4. line 7805: As a result of the unlink request, usbipd cancels the request from (2)
  5. line 7806: usbipd reports the requested unlink was successful
  6. line 7807: Linux starts reconfiguring the device

Same happens on lines: 13140, 13141, 13170, 13171, 13172, 13173

The only thing weird here is the first line: the IN pipe errors with EPIPE. I don't know why what is. What I do know is that EPIPE is a translation of a VBoxUSB driver specific return code. Maybe it should be translated to something else.

@comsyspro
Copy link
Author

comsyspro commented Jun 8, 2024

What I can say is that on v2.4.1 it worked but here I can get a BSOD on the server side very soon. So the difference should be between v2.4.1 and v3.0.0. I also tried to use v4.2.0 and put the Dirvers from v2.4.1 in the build but that didn't solve the issue so I think it's not a driver related thing. As client I'm using https://github.com/vadimgrn/usbip-win2 latest version. In the usbip-ude.txt log file you can see the "endpoint_purge" and I think these are the 6 endpoints of the USB device. It seems when stop rotating the knob all 6 endpoints get purged and perhaps this leads to the reboot. Here are further investigations concerning the client code vadimgrn/usbip-win2#75 (comment).

Here are the logs:
1.) usbipd v4.2.0 - here the device was attached and started without doing anything on the device - the device reboots is running - after 2 seconds it reboots - then it's running again - after 2 seconds it reboots - then I detached it
FILE_1.zip

2.) usbipd v4.2.0 - here the device was attached and started with rotating a knob - the device is running while rotating the knob - stop rotating the knob after 2 seconds it reboots - then I detached it
FILE_2.zip

3.) usbbipd v2.4.1 - here the device was attached and started normally - device works normal without reboot - giving some inputs rotating a knob - then I detached it - as mentioned v2.4.1 can lead to BSOD on the usbipd server side.
FILE_3.zip

@dorssel
Copy link
Owner

dorssel commented Jun 9, 2024

I also need the logs from Logging:LogLevel:Default=Trace.

The "translation" from the endpoint error to EPIPE is done by usbipd-win. Maybe it should be a different error. I suspect it should be something like "timeout".

@comsyspro
Copy link
Author

I've done the logging with usbipd server Logging:LogLevel:Default=Trace "usbipd:PcapNg:Path=C:\log\FILE_1.pcapng". Should I change here something?

@dorssel
Copy link
Owner

dorssel commented Jun 9, 2024

I need the console output.

@comsyspro
Copy link
Author

comsyspro commented Jun 9, 2024

Here is the output copied from console (usbipd v4.2.0 with 3 reboots):
console_output_v4.2.0.txt.zip

And here is console output (usbipd v2.4.1 without reboots):
console_output_v2.4.1.txt.zip

@dorssel
Copy link
Owner

dorssel commented Jun 9, 2024

Nothing that I can find so far. Do you also have a "usbip-ude.txt" for 2.4.2, so I can maybe find the difference?

@comsyspro
Copy link
Author

comsyspro commented Jun 10, 2024

Thank you so far. Here are the logs with v2.4.1 (start capture - attaching the device - rotating a knob - detaching - stop capture):.
usbip-ude.zip
usbip-flt.zip

another idea / guess:
As you can see in the screenshot of the device the enpoint no. 3 "Application Sepcific Interface/Device Firmware Update" is probably used for firmware upgrades. Perhaps when purging this endpoint the device goes into reboot but this is just a guess. Is it possible somehow to only purge specific endpoints so you could say do only purge e.g. no. 1, 2, and 4. As long as rotating a knob nothing is purged and the device stays online. When stop rotating the knob all endpoints get purged and perhaps that leads to the reboot.

@dorssel
Copy link
Owner

dorssel commented Jun 10, 2024

No, not all endpoints are purge (despite what the usbip-ude files tells you). Only input endpoint 3, and only after it first errors (probably due to a inactivity/timeout). Which happens on both 2.4.1 and on 4.2.0.

The only difference is that on 2.4.1, the sequence was always:

  1. Linux issues IN-URB
  2. Linux request unlink of (1)
  3. usbipd-win cancels URB
  4. usbipd-win reports (1) as EPIPE
  5. usbipd-win reports unlink (2) as "no longer pending"

And on 4.2.0 it is improved to:

  1. Linux issues IN-URB
  2. Linux request unlink of (1)
  3. usbipd-win cancels URB
  4. This is actually a no-op: the canceled URB is not reported to Linux
  5. usbipd-win reports (2) as "OK, I've canceled it"

There is always a race condition when canceling/unlinking an URB: either the URB completes ("wins") before you could actually cancel it, or the cancelation "wins". The last option is actually the better one, and the first can still sporadically happen but causes a lot of logging in syslog. In 2.4.1, we always reported like the first option happened, which caused a lot of people to complain about excessive logging. In 4.2.0, option 2 is used (if we can), unless the race really was won by the URB.

The Pcaps for 2.4.1 and 4.2.0 are identical, except for the fact that 2.4.1 did not yet log UNLINK. This is the only difference.

@comsyspro
Copy link
Author

comsyspro commented Jun 10, 2024

So what can I try in the code to test v4.2.0 with the option used in v2.4.1. Or can I define a condition if this "deviceid:vendorid" is used then try to use option 1 and if not option 2. At the moment I can't understand exactly the two options. I would implement the code v4.2.0. if you could give a code snippet to test and recompile if that would work. And if it is a timeout issue can this be temporarily bridged in the code?

Perhaps here?

// UNLINK strategy

Or here?

// Upon UNLINK, Linux expects the URB to actually be canceled (besides not wanting to know the SUBMIT result,

@dorssel
Copy link
Owner

dorssel commented Jun 10, 2024

You can try replacing the following line:

var won = PendingSubmits.TryRemove(unlink.submit_seqnum, out var _);

with simply:

var won = false;

This will make UNLINK believe that it lost the race condition. It will also make every URB return its result, even if it was canceled (as it won't be removed from the queue). That will mimic old behavior. Not exactly, though, as the ordering of URBs and UNLINKs may be different.

Can you also post the syslog of:

  • 2.4.1
  • 4.2.0 (release)
  • 4.2.0 (after above change)

syslog reports UNLINKs quite well, so it could give some more hints as to what is happening. I still don't understand what is going wrong exactly.

@comsyspro
Copy link
Author

comsyspro commented Jun 11, 2024

I changed to var won = false; but that didn't fix the issue, it's same as before. Where can I find the syslog as I'm using Windows<->Windows and not Windows<->Linux.

Can it be that also here is the problem? Is it possible to control which endpoints are canceled?

// On Linux, this is handled alongside with (1), but the VBoxUSB driver cannot cancel individual

Perhaps it has also to do something with handling the endpoints and here it purges somehow an endpoint which reboots the device. So can you define somehow which endpoints could be influenced or not? Perhaps one endpoint is used for firmware updates and when this is triggered it goes into a boot mode.
https://github.com/dorssel/usbipd-win/blob/master/Usbipd/AttachedEndpoint.cs

On the other hand the device works on v2.4.1 for about 30 seconds then on the usbipd side Windows gets a BSOD und crashes.

@dorssel
Copy link
Owner

dorssel commented Jun 11, 2024

Is it possible to control which endpoints are canceled?

Only endpoint 3 is canceled. The comment is about canceling (unlinking) individual requests (URBs) on that endpoint. This is not possible with the VBoxUSB driver. But in your case there is only 1 outstanding URB on endpoint 3, so this does not apply.

Perhaps it has also to do something with handling the endpoints and here it purges somehow an endpoint which reboots the device.

No, this is not it. Only endpoint 3 is canceled, as it is requested by the client. I know that usbip-ude.txt mentions purging the other endpoints, but in practice no such calls are being made (they do not show up in the Pcaps).

Where can I find the syslog

Never mind. I was thinking about Linux. On Windows, you currently have all the logging that is available.

On the other hand the device works on v2.4.1 for about 30 seconds then on the usbipd side Windows gets a BSOD und crashes.

I still don't know what is the root cause of the problem. It starts with an EPIPE on the input endpoint after you stop providing input to the device. But this also shows up in the 2.4.1 logs, and nothing has changed there. Then it unlinks the next URB, which leads to an AbortPipe (also on both 2.4.1 and 4.2.0); that too has not changed. The only change between 2.4.1 and 4.2.0 is how the cancelation is reported back to the client, and you have tried both methods now.

For the time being, you could checkout the 2.4.1 source, and backport the BSOD fix: #602

@comsyspro
Copy link
Author

comsyspro commented Jun 11, 2024

I will try to do that. Also the developer of usbip-win mentioned that it could be something with power management and selective suspend:
vadimgrn/usbip-win2#75 (comment)

But the device normally doesn't go into a powersaving mode. As you can see in the video it goes directly after 1 second in reboot after stop doing somthing on the device.

You told that endpoint 3 is canceled and this endpoint is used as "Application Sepcific Interface/Device Firmware Update". I think that's the reason for the reboot. So how could you prevent from canceling endpoint no. 3?
image

Thanks for your investigation. I'm not giving up to finding out the reason yet. I want to know what it is.

@dorssel
Copy link
Owner

dorssel commented Jun 11, 2024

Endpoint 3 seems to be used for everything. It is also the one used to report the knob data. For some reason (unknown to me), the device STALLs the URB if there is no actual input. This is weird, but not necessarily a problem. Next (after the EPIPE), a new URB is requested, apparently to replace the previously EPIPEd input URB. This is also normal. Next, it gets unlinked. This is really weird, I don't know why the client does that. But it also does it using 2.4.1, so it shouldn't be a problem either.

You could just not cancel the URB upon unlink. But that goes completely against the specs. UNLINK should really complete or cancel the URB as soon as possible.

The cancelation happens here:

await Device.IoControlAsync(SUPUSB_IOCTL.USB_ABORT_ENDPOINT, StructToBytes(clearEndpoint), null);

You could skip that by replacing:

if (pendingSubmits == 0)

with:

if (true)

That simply makes UNLINK a no-op. But I doubt it is going to work correctly.

@comsyspro
Copy link
Author

comsyspro commented Jun 11, 2024

I've found something new. I checkout to v2.4.1 and replaced exactly these lines #602:
2aa1573?diff=split&w=1

I think I can narrow it down further:

// beginning at line 373, with this code (in else statement) the USB device goes into reboot
urb.buf = IntPtr.Zero;
StructToBytes(urb, bytes);
ioctl = Device.IoControlAsync(SUPUSB_IOCTL.SEND_URB, bytes, bytes);

With these lines I don't get a BSOD here, but the reboot issue begins. So it has to do with that code. When I remove these lines again and go back to v2.4.1 then the USB device works as it should but I get than the BSOD. So do you know how to solve this dilemma? And can now also something be changed in v4.2.0 with this?

@dorssel
Copy link
Owner

dorssel commented Jun 11, 2024

It is related, but not the problem.
From the first PCap of #970 (comment):

  • From the start, every BULK_OUT to endpoint 3 has some data, sometimes small (e.g. line 1656), sometimes big (e.g., line 1660).
  • Although there are fewer reads, every BULK_IN also gets some data; it starts with some big blobs (e.g. line 50 is the first). Then always small (e.g. line 96).

It looks like the device is supposed to work like this: there is always one outstanding read request. And in the meantime there are a few write requests. Always with data. The read requests all return within around 20 ms.

Then something goes wrong (reason unknown). The last successful read (with data) is line 1836. The next request is issued at line 1838. This gets completed at line 1882. However, there is no data (0 bytes read). It also took 32 ms to complete (longer than usual). The next read request gets scheduled immediately (line 1883). From this moment on, the writes are also 0-length, starting at line 1884. Things are now not working correctly. (NOTE: the BSOD fix actually fixes these 0-length writes). But I think the problem is that everything is now 0-length; there is no actual data communicated anymore. It ends with one of the read requests actually STALLING (that's a USB-term, the Linux error code is EPIPE), at line 1906. Another read is scheduled at line 1907, which never completes. The device driver asks for cancellation after 2 seconds by sending an UNLINK. Now the device reboots.

Maybe (just maybe) we can insert an extra STALL recovery right after the first stall is detected. Normally the device driver should do that, but for some reason it just asks for another read (while a stall has already occurred). That one of course never completes, and after 2 seconds the driver gives up, purges everything and reconfigures the device (which you see as a reboot).

I'll try to add something like "auto-stall-recovery" in usbipd-win. May have to wait until the weekend or so.

NOTE: once the stall occurs, all writes are 0-length, which (under circumstances/depending on bad luck) could previously lead to BSOD. The fix and your reboot are not related; they are just happening around the same time.

@comsyspro
Copy link
Author

comsyspro commented Jun 12, 2024

That would be great if this will work. I tested to modify something in the else statement after that line (in v2.4.1 the code is in AttachedClients.cs, in v4.2.0 the code is found here in AttachedEndpoint.cs) here

and (added some more try catch lines) and had the impression to make it a little bit more stable so the BSOD was coming only after 3 minutes or so.

@comsyspro
Copy link
Author

When do you think you could find the solution for this?

@dorssel
Copy link
Owner

dorssel commented Jun 18, 2024

I've looked into it, but if I add an automatic pipe reset, everything fails. It is not supposed to happen like that. The client (or the client hub) is supposed to do the pipe reset. This happens normally for all USB devices, every now and then (especially with poor cables): PIPE stalls, device reports EPIPE, client (or hub, I don't know) resets pipe.

For some reason (in your logs), instead of clearing the pipe after receiving an EPIPE stall indicator, another read request is made. This is not supposed to happen on a stalled pipe. The client (or the hub) should have reset the pipe first. What client are you using? You said something about a Windows client?

@comsyspro
Copy link
Author

comsyspro commented Jun 19, 2024

I'm using this USBIP client https://github.com/vadimgrn/usbip-win2. As said this USB device works normally with this USBIP client from Windows to Linux USBIP server. The developer told that he probably will begin to develop a USBIP server. But for now your usbipd is the only way for Windows. If you have any idea what to do please let me know to test it out. Here is also the driver code of USBIP client and eventually you can see if here is something which causes the Pipe stall / EPIPE https://github.com/vadimgrn/usbip-win2/tree/master/drivers/ude.

@dorssel
Copy link
Owner

dorssel commented Jun 24, 2024

What would really help is a "known good" capture when the device is physically connected to a native Linux host. You can use wireshark or tshark to capture a PcapNg file from the Linux USB monitoring device.

@comsyspro
Copy link
Author

comsyspro commented Jun 24, 2024

Here is a fresh usbip capture. Used USBIP server from Ubuntu 24.04 and following command sudo tcpdump -i any tcp port 3240 -w usbip_capture.pcapng and the USB device works here fine.

following steps were captured:

  • start capture
  • attaching the device
  • rotating a knob for ~3 seconds
  • pausing ~2 seconds
  • rotating another knob for ~3 seconds
  • pausing ~2 seconds
  • rotating a controller wheel for ~3 seconds
  • pausing ~2 seconds
  • rotating a controller wheel for ~3 seconds
  • pausing ~2 seconds
  • detaching the device
  • stop capture

usbip_capture.zip

I hope you can read it out and come to a result. Thanks a lot.

@dorssel
Copy link
Owner

dorssel commented Jun 24, 2024

Maybe the info is in there somewhere, but better capture the USB traffic itself (and not the USBIP network transport layer). See https://serverfault.com/questions/1104416/how-to-capture-usb-traffic-using-wireshark-in-linux-cli. You may need to load a module (modprobe xxx) for the usbmon device..

@comsyspro
Copy link
Author

Ok, here is the capture from the USB device directly connected to the linux system (no USBIP).

You have also some infos of the USB device done with lsusb -s -v 003:005 in lsusb.txt file.

following steps were captured:

  • start capture
  • start davinci resolve software which detects the device
  • rotating a knob for ~3 seconds
  • pausing ~2 seconds
  • rotating another knob for ~3 seconds
  • pausing ~2 seconds
  • rotating a controller wheel for ~3 seconds
  • pausing ~2 seconds
  • rotating a controller wheel for ~3 seconds
  • pausing ~2 seconds
  • stop davinci resolve software
  • stop capture

usb_capture.zip

@dorssel
Copy link
Owner

dorssel commented Jun 25, 2024

What is most striking, is that with usbipd-win, both the IN and OUT endpoints are endpoint 3.
Whereas on Linux, the IN endpoint is 3, but the OUT endpoint is 4!

I need to figure out why that is. No time now, but this is very interesting! Will get back to you.

@comsyspro
Copy link
Author

Can you please say if you can look again into the issue because I can test this device only for this week.

@dorssel
Copy link
Owner

dorssel commented Jul 3, 2024

I just compared the USB capture from usbipd-win (FILE_2.zip) and Linux (usb_capture.zip).

Did you do a firmware upgrade in-between? Because the device configuration is different (they are the same device, right?):

Windows:

image

(NOTE: bcdUSB: 0x0210, this is supposed to be the USB version)

Linux:

image

(NOTE: bcdUSB: 0x0200)

Also, the reported interfaces and endpoints are different, which explains my earlier comment about the pipe numbers.

Windows:

image

(NOTE: pipe IN and OUT are both 3 & additional interface 5.0)

Linux:

image

(NOTE: pipe IN 3 and OUT 4 & no interface 5.0)

We are comparing apples and oranges...

@comsyspro
Copy link
Author

comsyspro commented Jul 3, 2024

I can't remember which file was from which date. Yes there was a firmware update but the last send files should be on the same firmware. The device is the same. So what I can do is to make it more simpler for you?

The FILE_2.zip was from 08.06.2024 and the usb_capture.zip from 24.06.2024 and in between there was a firmware upgrade.

The last two send files usbip_capture.zip and usb_capture.zip are created on 24.06.2024 and had the same firmware, usbip_capture.zip captured the USBIP traffic and usb_capture.zip captured the local attached USB traffic.

@dorssel
Copy link
Owner

dorssel commented Jul 3, 2024

I am trying to compare the Linux native USB traffic (your last usb_capture.zip) with the one from Windows via usbipd-win.
I don't think you sent any captures from usbipd-win with the new firmware (they all predate the firmware update, right?)

Can you make another usbipd-win capture with the same device/firmware combo as the latest Linux native capture?

@comsyspro
Copy link
Author

Here are the USBIPD Logs. If you could achieve something today it would be great because the device is only available for today and I will probably only get hands on the device again in 1 or 2 month.

20240704_1_LOGS.zip

@dorssel
Copy link
Owner

dorssel commented Jul 4, 2024

These look similar to the previous Windows logs: 5 interfaces, interface 4 has both IN 3 and OUT 3 pipes. And EPIPE after effective communication has stopped (0 byte payloads). Unfortunately, I cannot compare them to the Linux captures (which shows a different endpoint configuration). It still looks like the captures are either from different devices, or from different firmware revisions; sorry.

@comsyspro
Copy link
Author

I will look at it. I just noticed that when connected locally to Linux the device asks for a firmware update and then it asks for firmware v2.0 update. If you connect it locally to Windows, then it also wants to update to v2.1. So I have to try to set the device to a fixed firmware and not change it ignoring the update questions. I'll then do all the logs again and post them here.

@comsyspro
Copy link
Author

comsyspro commented Jul 4, 2024

So, here are the new logs. Firtsly the firmware of the device was updated to version 2.2.1 on Windows. Than the capture (file and console) was done for USBIPD. Than with that actual firmware version 2.2.1 the device was connected locally on Ubuntu and captured the USB (with tcpdump of usbmon) and lastly it was captured the USBIP traffic (with tcpdump of tcp port 3240). I hope you can now compare the results.

20240704_logs.zip

@comsyspro
Copy link
Author

comsyspro commented Jul 4, 2024

I just did another capture of the locally connected USB device on Windows with wireshark USBPcap. It's added in the win folder.

Update:
20240704_2_logs.zip

@comsyspro
Copy link
Author

And could you make any progress with my last sent log files? What do you think about it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants