[FIXED] Apparent overloading when moving RGB rotary encoder

francoytx · September 16, 2020, 2:40pm

This is a long post, answering bits of many of your previous ones, sorry for this, but I was taking my time to test and write this.

I totally missed the last part of this message, sorry for that.

So with the 12 ms show() time this improved… it’s still interesting because from what I see in the videos the timing is very slow for this to come up as a problem…

I’m testing with Turn which has 96 (32 encoder switches + 64 buttons) digital inputs, sending (with Max4Live) groups of 6 notes every 40 ms. That results in 12 (6 notes ON + 6 notes off) * 1000/40 = 12*25 = 300 messages per second and I don’t see stuck LEDs

In your controller there is only 2 strips being updated, one for the encoders and one for the buttons. The digital strips are horizontal, and in groups of 16.
In this image you can see the LED indexes for your model:
seq616-led-index

I also tried updating 4 buttons every 500 ms and turning the encoders and can’t reproduce the glitch you have.
Again, I am using Max for Live instead of RtMIDI, but this shouldn’t be relevant…
Here is the max patch in case you have a way to test it: SeqTestV2.amxd (69.7 KB)

Not really, I know I left the original comments there, but the 12 in the line
uint32_t sampleRate = 12; //sample rate, determines how often TC5_Handler is called

Is not really 12 hz. Actually the timer interrupts every 350 uS. If you set that to 13, it jumps to 3.6 ms. Set it to 11 and it interrupts every 6.3 ms. It’s weird, I know, I left it working at 350us promising myself to have more control over it in the future. The time may have arrived

I’m using micros() to measure it, I’m sure there’s a better way, but what I get is pretty consistent, so I’m guessing it is accurate enough.

To measure this I used the following code:

void TC5_Handler (void) {
  SerialUSB.println(micros()-antMicrosLoop);
  // Call MIDI read function and if message arrived, the callbacks get called
  MIDI.read();

  MIDIHW.read();
  
  // END OF YOUR CODE
  TC5->COUNT16.INTFLAG.bit.MC0 = 1; //Writing a 1 to INTFLAG.bit.MC0 clears the interrupt so that it will run again
  antMicrosLoop = micros();
}

The complete timer ISR, when no messages arrive, takes 9 us. When a note arrives, the ISR takes 130-180 us to complete for each message.

The code to measure this is similar:

void TC5_Handler (void) {
  antMicrosLoop = micros();
  // Call MIDI read function and if message arrived, the callbacks get called
  MIDI.read();

  MIDIHW.read();
  
  // END OF YOUR CODE
  TC5->COUNT16.INTFLAG.bit.MC0 = 1; //Writing a 1 to INTFLAG.bit.MC0 clears the interrupt so that it will run again
  SerialUSB.println(micros()-antMicrosLoop);
}

We placed the MIDI reads in the ISR because it works A LOT better when there are dumps of incoming messages, i.e.: when ableton has a track select event with a script dumping all the track data.
The loop takes from 500 to 2500 us to complete, depending on how many components it has to poll, so if the MIDI reads are there, it can fail to receive all the messages.

I think the stuck LEDs problem is on the SAMD21 <-> SAMD11 communication, and not on the MIDI reception.
We tested the midi reception sending with RtMIDI a specific amount of messages/second and counting them on the controller. We got to the point of receiving all of them with no drops.

We’ll work on it until we get it right!

gc3 · September 16, 2020, 4:17pm

@francoytx: nobody should apologize to me for making long posts

(BTW, please never feel any pressure to respond in detail, or at all, on these theories–I don’t want to overwhelm you with message processing and stop your loop from doing actual work! I know you’re putting a TON of effort into debugging this ASAP)

Anyway, thanks for the great information! That’s very good to know, esp. the frequency difference on the TC5 (I made assumptions rather than measuring…)

I’ll look at the M4L patches and think about reproducibility. (Not sure whether 12ms improved things or not, BTW, I haven’t sat down and counted pixels yet).

Totally agree that MIDI reception doesn’t seem to be the problem–as I said to @mike elsewhere, I can flash ALL 96 buttons at a very high rate and it looks fine (as long as I’m not turning an encoder). My suspicion is that there are weird timing interactions between the two controllers (probably explaining both issues).

As to the stuck LED one, my hypothesis is that when a “heavy” ISR happens (meaning it’s processing several messages at once and getting into the ms range) at the wrong time, that’s when things get stuck. If that’s what’s happening, we would look first at timing-dependent behaviors related to SAMD21/SAMD11 communication.

Two things jump out at me (no time to play with them now):

the 1ms blocking while loop in feedback.SendFeedbackData:892, waiting for ACK (0xAA), could take unpredictably long if the interrupt hits in the middle of it;
the SAMD11 is sending SHOW_END approximately every LED_SHOW_TICKS (subject to its own interrupt), because that gets fired whenever timeToShow is true since it’s outside the conditional, even though SHOW_IN_PROGRESS is only sent when ledShow && (!receivingBank || !receivingLEDdata).

Could SHOW_END be stepping in front of the ACK so that the wait logic never sees it and has to retry too many times? Could it occasionally be filling the incoming serial buffer on the SAMD21 and blocking the ACK from getting saved (I think if the ring buffer fills then writes are just dropped)?

Also, I’m sure you know this, but isn’t micros() limited (see discussion here) to a fairly small range during interrupts, after which it returns a value mod that range? So if the midi ISR were consistently taking (say) ~2130 to 2180us, you might be seeing that as 130 to 180us in your measurement code (precise/consistent but not accurate). No idea whatsoever if this is actually happening, of course!

gc3 · September 16, 2020, 6:12pm

Quick experimental results:

Moving SHOW_END into the condition doesn’t seem to change anything (and may create other problems).
However, putting MIDI.read() in the loop (and commenting it out in the ISR) seems, on a first run, to significantly improve (but NOT eliminate) the stuck pixels. I had to wait about 2 minutes to see the first one; usually I see one within 20-30 seconds.

I’m going to block out 20 minutes and actually do repeated tests later. Just wanted to give you initial impressions.

gc3 · September 16, 2020, 7:17pm

And a last note for now: I got .println() debugging working in Arduino IDE (level up!) and can more or less confirm that MIDI notes are reaching the callback functions in the correct order and groups even when stuck pixels happen.

francoytx · September 16, 2020, 7:49pm

well I think you actually read the incorrect comments on the code, so your assumptions were based on something present in the code haha

I saw this and already fixed it. It now sends “SHOW_END” only if show() actually begins.

I know millis() inside an ISR is not available, but I think micros() is.
The discussion you link to is about AVR which have 8/16 bit registers and might overflow. I think this is less of a problem in SAMD arch and micros() is quite reliable. A different way to test this can be with an oscilloscope setting a pin HIGH entering the ISR and LOW when leaving. Can be very useful with stable ISR durations, but might be hard to track when a big difference in timing occurs. Anyhow, like we’ve been saying, the problem seems to be somewhere else.

Great! Yeah, like i said before, MIDI reception is pretty solid. I’d aim at the buffering, and the SAMD21 <-> SAMD11 communication.

This affects the MIDI reception, if you try to populate the 96 buttons like you said in mike’s post, you’ll probably get a poorer performance with the MIDI reads in the loop.

Thanks for these reports!

francoytx · September 16, 2020, 9:47pm

I did a 15 minute watch test sending notes to light up 4 LEDs at a time every 500 ms and I got 9 pixels that didn’t turn ON and 2 pixels that didn’t turn OFF.

In my case I am almost certain it was always the first of the 4 LEDs that didn’t turn on. The ones that didn’t turn OFF were not the same position
There’s a starting point.

gc3 · September 18, 2020, 5:41am

Great! Thanks. Glad you can reproduce. If you up the pixel/push count you can probably get it to stick reliably within the first 30s (less boring for debugging). If we don’t get this figured out soon I’ll start some data gathering about what affects rates, maybe with a webcam and some simple machine vision to do the counting. But I think if the right diagnostic output is on when a pixel sticks, cause will be pretty easy to figure out.

Will pick up other points later, & thanks for the clarification on micros(); if it’s not obvious, I know VERY little about actual hardware and am just analogizing from distributed systems programming

gc3 · September 19, 2020, 2:20am

I have log output which seems to differentiate failed updates from actual updates. (This is just a tiny snippet of the log but it’s the part that seems relevant).

This one (a NOTE_ON) successfully updated:

32842882 Calling FillFrameWithDigitalData 232
32843078 sendSerialBufferDec set 4 71 0 1 0 230 8 14 255
32844137 Calling SendFeedbackData from SendDataIfReady
32844411 ...waiting for ACK on try 0
32844583 ...received 170 //ACK

The very next call sequence (processing a NOTE_OFF) failed to update, resulting in a stuck pixel:

32844829 Calling FillFrameWithDigitalData 233
32845037 sendSerialBufferDec set 4 70 0 0 0 0 0 0 255
32846076 Calling SendFeedbackData from SendDataIfReady
32846344 ...waiting for ACK on try 0
32846485 ...received 249 // SHOW_IN_PROGRESS
32846760 ...waiting for ACK on try 1
32848008 ...waiting for ACK on try 2
32849229 ...waiting for ACK on try 3
32850495 ...waiting for ACK on try 4
32851733 ...waiting for ACK on try 5
32852957 ...waiting for ACK on try 6
32853417 ...received 170 // ACK

I’ve observed two stuck pixels so far, with exactly the same structure (try 0 gets 0xF9, timeouts until try 6 which gets 0xAA).

Other strangeness is present in the log but I won’t be able to look at it until tonight.

EDIT: Yep, this is quite consistently associated with stuck pixels. All the extra .print() and .println() calls seem to massively increase the incidence (which is independently interesting).

I just got eight stuck pixels (7 of them in the last position, one in the first) in one 16-step march with a 500ms delay. Every one of them got SHOW_IN_PROGRESS (249, 0xF9) back from try 0 instead of ACK. They took a variable number of retries to get 170: 2 more (n=1), 5 more (n=1), 6 more (n=3), 7 more (n=2), and 10 more (n=1, with 248 (CHECKSUM_ERROR) on retry #8 and 250 (SHOW_END) on retry #9). So 5-7 retries before ACK is typical but the distribution extends in both directions.

Additionally, one other update got 250 (SHOW_END) back from try 0, but 249 on try 1, and that one seems to have updated (not 100% sure).

francoytx · September 19, 2020, 5:18am

Hi @gc3!
Yes! I did a similar test.

It’d seem that from time to time the SAMD21 fails to realize that the SAMD11 is showing the neo pixels… so it tries a couple of times to send the same message

Until it gets an ACK and then follows on.

I sitll can’t seem to relate this to a stuck pixel every time.

I improved a bunch of stuff that was a bit off in the SAMD11 code, I will update the repo in this branch tomorrow.

Seems to be an improvement… I got 5-6 minutes without mistakes a couple of times.

There’s something I tested, which is making the SAMD21 signal the SAMD11 it can start to show pixels, but for now haven’t got it to work better than the current strategy, which is show regularly if there are changes to be made.

Still work to do! But I’m confident we’re narrowing it down!

francoytx · September 22, 2020, 12:34am

Hi @gc3!

Let’s try with this set:
ytx-v2-firmware-aux-0-14-testing.bin (11.1 KB)
ytx-v2-firmware-main-0-14-testing.bin (74.5 KB)

I came up with something of a way to check how good the MAIN <-> AUX comms are working and it is by just counting how many failed events happen when MAIN tries to send a message.
It is printing this every time it happens.
It will print total accumulated number of fails, then if it failed because of timeout you will see the timeout microseconds, and then “3,X” or “4,X” where 3 or 4 are encoder switch or digital and X is the index that failed, and then you will see the read index and write index of the feedback buffer.
These prints are on lines 890-903 of feedback.ino in branch bugFixFeedbackUpdate.

I tried 2 times for 5 minutes sending 4 notes ON and 4 notes OFF every 100 milliseconds and got no failed events.

I can’t see any more stuck or no-show pixels.

Please test them when you can
If you still see stuck pixels, we can try lowering the Serial baud of the main <-> aux link.

gc3 · September 22, 2020, 4:04am

Heyo! Awesome! Thank you so much for pouring your time into this. It speaks REALLY well of you and your company. You’ve gotten at least one lifetime customer out of this beta

Initial test was encouraging (no stuck pixels after 2min, new record, with 6 every 125ms). Will test more thoroughly.

Not at all to diminish your apparent heroic victory over stuck pixels but I can’t tell if the other problem (timing breakdown on encoder turn) is better or not–probably a little, but it’s still there. I’ve been pretty sure from my code review and tests that they’re not that related. Will report back on that too–trying some things to smoke it out.

francoytx · September 22, 2020, 2:28pm

One problem at a time!

I think the stuck pixel problem is coming to an apparent solution, I’ll start working on the encoder performance issue soon

gc3 · September 23, 2020, 7:36am

One problem at a time–yes, absolutely!!! I was just hoping that maybe they were the same problem after all… But, for what it’s worth, the lag on encoder turn does seem subjectively better than v0.13 (it’s hard to measure).

I have done a few more tests and seen nothing sticking. I’ve been distracted from tests, though, because I’m now up and coding on my first application and quite confident that it will work.

You’re amazing!

Will report back when I do some more tests.

francoytx · September 23, 2020, 12:02pm

It was actually a teamwork @gc3 Your testing and input on the code was quite helpful

I’m glad you’re not seeing any more stuck pixels! That’s very good news

Still work to do, but it may be very close to be fixed

gc3 · September 30, 2020, 2:41am

Quick update after a lot of prototype app development on my end (on which more soon!):

I haven’t seen any stuck pixels at low/medium rates with v0.14. When I really crank the BPM I still see them occasionally, but that may be pushing the limits of the MIDI spec anyway, and it’s certainly usable.

The encoder-turn update lag is definitely still there although it’s subjectively maybe 30% better than v0.12–unless I’m fooling myself. I’m not working on encoder-turn-heavy parts of the prototype yet so I’m not actively troubleshooting it but I will when the time comes. @francotyx, when you start working on that (if you’re not already) let me know if you’re not able to replicate based on the above description and the code/video and I’ll work with you on that.

And thanks again! I’m already having fun with the proto on this hardware. It’s so incredibly well-made and well-thought-out, physically and software-wise, and I can’t wait to see where it goes!

francoytx · October 2, 2020, 11:59pm

Hi @gc3!

I think I have a clue as to why the encoders might be interfering with the feedback timing.

Let’s try something, and if we see a meaningful difference, I’ll explain why this is happening.

In the header “Defines.h” change the tag MAX_WAIT_MORE_DATA_MS to 5 instead of 10. Maybe the name is different in a previous code version, but if you use the current develop branch the name will be this.

Let’s see if there is a difference with this.
It certainly gets a lot worse if I increase it, you can try that as well.

It is not a permanent solution, but it may help until I figure out how to solve it.

gc3 · October 3, 2020, 12:07am

Awesome! Good thought. I think I see the theory (at least in broad strokes) and will give it a shot ASAP (might not be until tomorrow). Many thanks for your perseverance on this!

gc3 · October 8, 2020, 5:23am

Sorry! Got slammed these last few days. Should be able to test this tomorrow.

francoytx · October 8, 2020, 2:59pm

No worries.

I’ve been working on it and made a few improvements.
The feedback Update function will only now wait for more data only if the entries to the buffer are external feedback, and not when setting feedback internally like when the encoders move or the buttons are pressed and set to Local.

Also, I changed a bit the function that sends the feedback frame to the SAMD11.
Now it won’t read the Serial port directly, but just wait until the flag is taken down by the Serial interrupt routine.
This way the Serial incoming data from the SAMD11 is only read in one place and there are no more non-ACK bytes producing errors.
For this, I had to comment some lines in variants.cpp which is a file in the hardware folder, so when the time comes to update, if you want to compile, you’ll need to update from the repo to your computer.
This will be specified in the release notes.

With these changes, I set the wait time (just testing it) to 50 ms and I could not notice any lag in the button’s sequence.

Let’s see how it goes for you!

gc3 · October 8, 2020, 4:53pm

OK, on a preliminary test this works like absolute magic, @francoytx, you’re a genius. I see no problems whatsoever with any of my existing tests.

I’ll keep poking it and let you know if I notice anything, but I think the FW is now at the point where I can begin full development.

I can’t thank you enough!