[FIXED] Apparent overloading when moving RGB rotary encoder

gc3 · September 18, 2020, 8:39pm

I have log output which seems to differentiate failed updates from actual updates. (This is just a tiny snippet of the log but it’s the part that seems relevant).

This one (a NOTE_ON) successfully updated:

32842882 Calling FillFrameWithDigitalData 232
32843078 sendSerialBufferDec set 4 71 0 1 0 230 8 14 255
32844137 Calling SendFeedbackData from SendDataIfReady
32844411 ...waiting for ACK on try 0
32844583 ...received 170 //ACK

The very next call sequence (processing a NOTE_OFF) failed to update, resulting in a stuck pixel:

32844829 Calling FillFrameWithDigitalData 233
32845037 sendSerialBufferDec set 4 70 0 0 0 0 0 0 255
32846076 Calling SendFeedbackData from SendDataIfReady
32846344 ...waiting for ACK on try 0
32846485 ...received 249 // SHOW_IN_PROGRESS
32846760 ...waiting for ACK on try 1
32848008 ...waiting for ACK on try 2
32849229 ...waiting for ACK on try 3
32850495 ...waiting for ACK on try 4
32851733 ...waiting for ACK on try 5
32852957 ...waiting for ACK on try 6
32853417 ...received 170 // ACK

I’ve observed two stuck pixels so far, with exactly the same structure (try 0 gets 0xF9, timeouts until try 6 which gets 0xAA).

Other strangeness is present in the log but I won’t be able to look at it until tonight.

EDIT: Yep, this is quite consistently associated with stuck pixels. All the extra .print() and .println() calls seem to massively increase the incidence (which is independently interesting).

I just got eight stuck pixels (7 of them in the last position, one in the first) in one 16-step march with a 500ms delay. Every one of them got SHOW_IN_PROGRESS (249, 0xF9) back from try 0 instead of ACK. They took a variable number of retries to get 170: 2 more (n=1), 5 more (n=1), 6 more (n=3), 7 more (n=2), and 10 more (n=1, with 248 (CHECKSUM_ERROR) on retry #8 and 250 (SHOW_END) on retry #9). So 5-7 retries before ACK is typical but the distribution extends in both directions.

Additionally, one other update got 250 (SHOW_END) back from try 0, but 249 on try 1, and that one seems to have updated (not 100% sure).

francoytx · September 19, 2020, 5:13am

Hi @gc3!
Yes! I did a similar test.

It’d seem that from time to time the SAMD21 fails to realize that the SAMD11 is showing the neo pixels… so it tries a couple of times to send the same message

Until it gets an ACK and then follows on.

I sitll can’t seem to relate this to a stuck pixel every time.

I improved a bunch of stuff that was a bit off in the SAMD11 code, I will update the repo in this branch tomorrow.

Seems to be an improvement… I got 5-6 minutes without mistakes a couple of times.

There’s something I tested, which is making the SAMD21 signal the SAMD11 it can start to show pixels, but for now haven’t got it to work better than the current strategy, which is show regularly if there are changes to be made.

Still work to do! But I’m confident we’re narrowing it down!

francoytx · September 22, 2020, 12:28am

Hi @gc3!

Let’s try with this set:
ytx-v2-firmware-aux-0-14-testing.bin (11.1 KB)
ytx-v2-firmware-main-0-14-testing.bin (74.5 KB)

I came up with something of a way to check how good the MAIN <-> AUX comms are working and it is by just counting how many failed events happen when MAIN tries to send a message.
It is printing this every time it happens.
It will print total accumulated number of fails, then if it failed because of timeout you will see the timeout microseconds, and then “3,X” or “4,X” where 3 or 4 are encoder switch or digital and X is the index that failed, and then you will see the read index and write index of the feedback buffer.
These prints are on lines 890-903 of feedback.ino in branch bugFixFeedbackUpdate.

I tried 2 times for 5 minutes sending 4 notes ON and 4 notes OFF every 100 milliseconds and got no failed events.

I can’t see any more stuck or no-show pixels.

Please test them when you can
If you still see stuck pixels, we can try lowering the Serial baud of the main <-> aux link.

gc3 · September 22, 2020, 3:49am

Heyo! Awesome! Thank you so much for pouring your time into this. It speaks REALLY well of you and your company. You’ve gotten at least one lifetime customer out of this beta

Initial test was encouraging (no stuck pixels after 2min, new record, with 6 every 125ms). Will test more thoroughly.

Not at all to diminish your apparent heroic victory over stuck pixels but I can’t tell if the other problem (timing breakdown on encoder turn) is better or not–probably a little, but it’s still there. I’ve been pretty sure from my code review and tests that they’re not that related. Will report back on that too–trying some things to smoke it out.

francoytx · September 22, 2020, 2:28pm

One problem at a time!

I think the stuck pixel problem is coming to an apparent solution, I’ll start working on the encoder performance issue soon

gc3 · September 23, 2020, 7:36am

One problem at a time–yes, absolutely!!! I was just hoping that maybe they were the same problem after all… But, for what it’s worth, the lag on encoder turn does seem subjectively better than v0.13 (it’s hard to measure).

I have done a few more tests and seen nothing sticking. I’ve been distracted from tests, though, because I’m now up and coding on my first application and quite confident that it will work.

You’re amazing!

Will report back when I do some more tests.

francoytx · September 23, 2020, 12:02pm

It was actually a teamwork @gc3 Your testing and input on the code was quite helpful

I’m glad you’re not seeing any more stuck pixels! That’s very good news

Still work to do, but it may be very close to be fixed

gc3 · September 30, 2020, 2:41am

Quick update after a lot of prototype app development on my end (on which more soon!):

I haven’t seen any stuck pixels at low/medium rates with v0.14. When I really crank the BPM I still see them occasionally, but that may be pushing the limits of the MIDI spec anyway, and it’s certainly usable.

The encoder-turn update lag is definitely still there although it’s subjectively maybe 30% better than v0.12–unless I’m fooling myself. I’m not working on encoder-turn-heavy parts of the prototype yet so I’m not actively troubleshooting it but I will when the time comes. @francotyx, when you start working on that (if you’re not already) let me know if you’re not able to replicate based on the above description and the code/video and I’ll work with you on that.

And thanks again! I’m already having fun with the proto on this hardware. It’s so incredibly well-made and well-thought-out, physically and software-wise, and I can’t wait to see where it goes!

francoytx · October 2, 2020, 11:53pm

Hi @gc3!

I think I have a clue as to why the encoders might be interfering with the feedback timing.

Let’s try something, and if we see a meaningful difference, I’ll explain why this is happening.

In the header “Defines.h” change the tag MAX_WAIT_MORE_DATA_MS to 5 instead of 10. Maybe the name is different in a previous code version, but if you use the current develop branch the name will be this.

Let’s see if there is a difference with this.
It certainly gets a lot worse if I increase it, you can try that as well.

It is not a permanent solution, but it may help until I figure out how to solve it.

gc3 · October 3, 2020, 12:07am

Awesome! Good thought. I think I see the theory (at least in broad strokes) and will give it a shot ASAP (might not be until tomorrow). Many thanks for your perseverance on this!

gc3 · October 8, 2020, 5:23am

Sorry! Got slammed these last few days. Should be able to test this tomorrow.

francoytx · October 8, 2020, 2:59pm

No worries.

I’ve been working on it and made a few improvements.
The feedback Update function will only now wait for more data only if the entries to the buffer are external feedback, and not when setting feedback internally like when the encoders move or the buttons are pressed and set to Local.

Also, I changed a bit the function that sends the feedback frame to the SAMD11.
Now it won’t read the Serial port directly, but just wait until the flag is taken down by the Serial interrupt routine.
This way the Serial incoming data from the SAMD11 is only read in one place and there are no more non-ACK bytes producing errors.
For this, I had to comment some lines in variants.cpp which is a file in the hardware folder, so when the time comes to update, if you want to compile, you’ll need to update from the repo to your computer.
This will be specified in the release notes.

With these changes, I set the wait time (just testing it) to 50 ms and I could not notice any lag in the button’s sequence.

Let’s see how it goes for you!

gc3 · October 8, 2020, 4:53pm

OK, on a preliminary test this works like absolute magic, @francoytx, you’re a genius. I see no problems whatsoever with any of my existing tests.

I’ll keep poking it and let you know if I notice anything, but I think the FW is now at the point where I can begin full development.

I can’t thank you enough!