[FIXED] Apparent overloading when moving RGB rotary encoder

ah, many thanks! I haven’t figured out how to compile the aux firmware yet so this is very helpful.

I’ll reset to factory 0.13 main and then load this.

I didn’t post about it, but if you get Atmel Studio, just loading the SAMD11 project should work straightforward.

Compile for ATSAMD11D14AM. It’s included in the project settings though.

You hit “Build” and the binary file shows up in the ./Debug folder

I went ahead and grabbed it, and it did pick up the project settings, but I got some initial errors and didn’t dig further (figured I’d learn one firmware at a time :slight_smile: ). It’s probably just library linking stuff.

Trying new firmware now!

…Interesting. The encoder timing problems feel just the same as with factory firmware (I don’t have an objective measure of them). They’re definitely still there.

However, good news: I did get a few minutes without a stuck pixel in the MARCH test. Could be luck (it’s a stochastic problem) but I don’t think I’ve run the test that long without seeing one before. That test I can measure objectively (# of stuck pixels observed for N transits) and I’ll do that, for a low N, in a bit.

Do you remember if the strips are arranged in rows or columns on my build? If I understand the one-at-a-time limit, that might explain it, as that problem shows up when I’m blanking one column and setting the next in the same set of messages.

I’ll send video now (to your support email?) so that you won’t have to keep debugging based on my vague descriptions!

The samd11 doesn’t rely on libraries, just the ASF (Atmel Software Framework) which gets installed along with Atmel Studio. If you try again, post the errors and we can try to figure it out.

1 Like

I’ve had only a moment to look at this today, but here’s what seems like a key observation (regarding the lag problem, not the “stuck pixel” problem, which may or may not be connected).

Basically, it doesn’t look like the encoder ring feedback is what’s fighting the digital updates. I’ll try to keep narrowing in on the problem (assuming it’s single-caused).

This is on my minimal main firmware build, which simplifies the loop (as above) and implements the priority system. I’m using the updated version of your aux firmware from yesterday.

In addition to that, I commented out FillFrameWithEncoderData(fbUpdateQueueIndex); and SendDataIfReady(); under the FB_* cases. So only the digital buttons are being updated. The encoders are still sending commands, though (confirmed in MidiOx); they’re just not getting the ring/button feedback.

Twisting the encoders still interferes with the “march” very consistently. Sometimes it freezes for three or four beats and then jumps ahead. if anything, it seems worse (?!) than when the encoder feedback update was on. As before, it’s nice and regular timing-wise when the encoders aren’t tuning, although the stuck pixels still happen on their own. (I can send video later if this description’s not enough).

I got the SAMD11 build working after a moment’s look at the error messages.

  1. In the subcompilation, NeoPixels.c wasn’t finding NeoPixels.h, so I changed #include <NeoPixels.h> to #include "NeoPixels.h";
  2. Then, in the main compilation, main.c wasn’t finding NeoPixels.h, so I changed #include <NeoPixels.h> to #include "NeoPixel/NeoPixels.h" (and #include <ytxHeader.h> to #include "ytxHeader.h", pre-emptively). asf.h didn’t require a change.

I don’t know how exactly Atmel Studio tweaks the Visual Studio base but it looks as though src/NeoPixel wasn’t getting added to the project path (which is very weird as I see it in the .cproj file under <armgcc.compiler.directories.IncludePaths>). Anyway, easy fix, and now I can poke at the aux code too. Thanks for your help!

OK, here’s one theory (at least about the stuck pixels).

timer.TC5_Handler is getting called every 83.33ms (12Hz) as an interrupt service routine (ISR) to handle incoming MIDI and MIDIHW.

I’m a software person, not a hardware person, but my understanding of ISRs is that they’re supposed to do very little (like enqueue a value and then return immediately). If I’m reading the code right, TC5_Handler:

  • calls MIDI.read (and MIDIHW.read, but I’ll just focus on the USB side here), which:
    • Parses the entire serial buffer, yielding up to as many messages as can fit in it;
    • Potentially calls a callback for each method (as registered in setup), each of which:
      • Does some parsing of its own, and then calls ProcessMIDI, which:
        • Does some more message preparation, and then calls UpdateMIDIBuffer for each feedback type, which:
          • does a bunch more stuff, including a binary search for affected encoders.

I haven’t benchmarked it, and of course I may be misreading/misunderstanding, but if that’s all happening during the ISR, it seems like a lot! I wonder if the stuck pixels come from an occasional timing clash between that process and either the feedback or new MIDI reads.

I don’t know the reason that you put MIDI reads on an interrupt (and not in the loop), but if it needs to stay there, have you considered using the interrupt to do nothing but read raw MIDI messages into a fixed-length FIFO queue in memory (and then immediately return), and then consuming that queue (parsing the buffer into MIDI messages, dispatching to callbacks, doing your internal parsing and prep, then calling your own UpdateMIDIBuffer to fill the “cooked” buffer) in the main loop?

If you need that process to happen less (or more) frequently than the other items in the loop you could only call it (or call the other functions) mod N while maintaining a loop counter.

Anyway, I may be completely off-base, and I’m not sure how this would be affecting the encoder timing issue, but If I have time tomorrow I might pull MIDI.read() out of the interrupt and into the loop and see if the stuck pixels still appear occasionally (or if something else blows up!)

This is a long post, answering bits of many of your previous ones, sorry for this, but I was taking my time to test and write this.

I totally missed the last part of this message, sorry for that.

So with the 12 ms show() time this improved… it’s still interesting because from what I see in the videos the timing is very slow for this to come up as a problem…

I’m testing with Turn which has 96 (32 encoder switches + 64 buttons) digital inputs, sending (with Max4Live) groups of 6 notes every 40 ms. That results in 12 (6 notes ON + 6 notes off) * 1000/40 = 12*25 = 300 messages per second and I don’t see stuck LEDs :confused:

In your controller there is only 2 strips being updated, one for the encoders and one for the buttons. The digital strips are horizontal, and in groups of 16.
In this image you can see the LED indexes for your model:
seq616-led-index

I also tried updating 4 buttons every 500 ms and turning the encoders and can’t reproduce the glitch you have.
Again, I am using Max for Live instead of RtMIDI, but this shouldn’t be relevant…
Here is the max patch in case you have a way to test it: SeqTestV2.amxd (69.7 KB)

Not really, I know I left the original comments there, but the 12 in the line
uint32_t sampleRate = 12; //sample rate, determines how often TC5_Handler is called

Is not really 12 hz. Actually the timer interrupts every 350 uS. If you set that to 13, it jumps to 3.6 ms. Set it to 11 and it interrupts every 6.3 ms. It’s weird, I know, I left it working at 350us promising myself to have more control over it in the future. The time may have arrived :yum:

I’m using micros() to measure it, I’m sure there’s a better way, but what I get is pretty consistent, so I’m guessing it is accurate enough.

To measure this I used the following code:

void TC5_Handler (void) {
  SerialUSB.println(micros()-antMicrosLoop);
  // Call MIDI read function and if message arrived, the callbacks get called
  MIDI.read();

  MIDIHW.read();
  
  // END OF YOUR CODE
  TC5->COUNT16.INTFLAG.bit.MC0 = 1; //Writing a 1 to INTFLAG.bit.MC0 clears the interrupt so that it will run again
  antMicrosLoop = micros();
}

The complete timer ISR, when no messages arrive, takes 9 us. When a note arrives, the ISR takes 130-180 us to complete for each message.

The code to measure this is similar:

void TC5_Handler (void) {
  antMicrosLoop = micros();
  // Call MIDI read function and if message arrived, the callbacks get called
  MIDI.read();

  MIDIHW.read();
  
  // END OF YOUR CODE
  TC5->COUNT16.INTFLAG.bit.MC0 = 1; //Writing a 1 to INTFLAG.bit.MC0 clears the interrupt so that it will run again
  SerialUSB.println(micros()-antMicrosLoop);
}

We placed the MIDI reads in the ISR because it works A LOT better when there are dumps of incoming messages, i.e.: when ableton has a track select event with a script dumping all the track data.
The loop takes from 500 to 2500 us to complete, depending on how many components it has to poll, so if the MIDI reads are there, it can fail to receive all the messages.

I think the stuck LEDs problem is on the SAMD21 <-> SAMD11 communication, and not on the MIDI reception.
We tested the midi reception sending with RtMIDI a specific amount of messages/second and counting them on the controller. We got to the point of receiving all of them with no drops.

We’ll work on it until we get it right!

@francoytx: nobody should apologize to me for making long posts :laughing:

(BTW, please never feel any pressure to respond in detail, or at all, on these theories–I don’t want to overwhelm you with message processing and stop your loop from doing actual work! :slight_smile: I know you’re putting a TON of effort into debugging this ASAP)

Anyway, thanks for the great information! That’s very good to know, esp. the frequency difference on the TC5 (I made assumptions rather than measuring…)

I’ll look at the M4L patches and think about reproducibility. (Not sure whether 12ms improved things or not, BTW, I haven’t sat down and counted pixels yet).

Totally agree that MIDI reception doesn’t seem to be the problem–as I said to @mike elsewhere, I can flash ALL 96 buttons at a very high rate and it looks fine (as long as I’m not turning an encoder). My suspicion is that there are weird timing interactions between the two controllers (probably explaining both issues).

As to the stuck LED one, my hypothesis is that when a “heavy” ISR happens (meaning it’s processing several messages at once and getting into the ms range) at the wrong time, that’s when things get stuck. If that’s what’s happening, we would look first at timing-dependent behaviors related to SAMD21/SAMD11 communication.

Two things jump out at me (no time to play with them now):

  • the 1ms blocking while loop in feedback.SendFeedbackData:892, waiting for ACK (0xAA), could take unpredictably long if the interrupt hits in the middle of it;
  • the SAMD11 is sending SHOW_END approximately every LED_SHOW_TICKS (subject to its own interrupt), because that gets fired whenever timeToShow is true since it’s outside the conditional, even though SHOW_IN_PROGRESS is only sent when ledShow && (!receivingBank || !receivingLEDdata).

Could SHOW_END be stepping in front of the ACK so that the wait logic never sees it and has to retry too many times? Could it occasionally be filling the incoming serial buffer on the SAMD21 and blocking the ACK from getting saved (I think if the ring buffer fills then writes are just dropped)?

Also, I’m sure you know this, but isn’t micros() limited (see discussion here) to a fairly small range during interrupts, after which it returns a value mod that range? So if the midi ISR were consistently taking (say) ~2130 to 2180us, you might be seeing that as 130 to 180us in your measurement code (precise/consistent but not accurate). No idea whatsoever if this is actually happening, of course!

2 Likes

Quick experimental results:

  • Moving SHOW_END into the condition doesn’t seem to change anything (and may create other problems).
  • However, putting MIDI.read() in the loop (and commenting it out in the ISR) seems, on a first run, to significantly improve (but NOT eliminate) the stuck pixels. I had to wait about 2 minutes to see the first one; usually I see one within 20-30 seconds.

I’m going to block out 20 minutes and actually do repeated tests later. Just wanted to give you initial impressions.

And a last note for now: I got .println() debugging working in Arduino IDE (level up!) and can more or less confirm that MIDI notes are reaching the callback functions in the correct order and groups even when stuck pixels happen.

1 Like

well I think you actually read the incorrect comments on the code, so your assumptions were based on something present in the code haha

I saw this and already fixed it. It now sends “SHOW_END” only if show() actually begins.

I know millis() inside an ISR is not available, but I think micros() is.
The discussion you link to is about AVR which have 8/16 bit registers and might overflow. I think this is less of a problem in SAMD arch and micros() is quite reliable. A different way to test this can be with an oscilloscope setting a pin HIGH entering the ISR and LOW when leaving. Can be very useful with stable ISR durations, but might be hard to track when a big difference in timing occurs. Anyhow, like we’ve been saying, the problem seems to be somewhere else.

Great! Yeah, like i said before, MIDI reception is pretty solid. I’d aim at the buffering, and the SAMD21 <-> SAMD11 communication.

This affects the MIDI reception, if you try to populate the 96 buttons like you said in mike’s post, you’ll probably get a poorer performance with the MIDI reads in the loop.

Thanks for these reports!

1 Like

I did a 15 minute watch test sending notes to light up 4 LEDs at a time every 500 ms and I got 9 pixels that didn’t turn ON and 2 pixels that didn’t turn OFF.

In my case I am almost certain it was always the first of the 4 LEDs that didn’t turn on. The ones that didn’t turn OFF were not the same position
There’s a starting point.

1 Like

Great! Thanks. Glad you can reproduce. If you up the pixel/push count you can probably get it to stick reliably within the first 30s (less boring for debugging). If we don’t get this figured out soon I’ll start some data gathering about what affects rates, maybe with a webcam and some simple machine vision to do the counting. But I think if the right diagnostic output is on when a pixel sticks, cause will be pretty easy to figure out.

Will pick up other points later, & thanks for the clarification on micros(); if it’s not obvious, I know VERY little about actual hardware :slight_smile: and am just analogizing from distributed systems programming

I have log output which seems to differentiate failed updates from actual updates. (This is just a tiny snippet of the log but it’s the part that seems relevant).

This one (a NOTE_ON) successfully updated:

32842882 Calling FillFrameWithDigitalData 232
32843078 sendSerialBufferDec set 4 71 0 1 0 230 8 14 255
32844137 Calling SendFeedbackData from SendDataIfReady
32844411 ...waiting for ACK on try 0
32844583 ...received 170 //ACK

The very next call sequence (processing a NOTE_OFF) failed to update, resulting in a stuck pixel:

32844829 Calling FillFrameWithDigitalData 233
32845037 sendSerialBufferDec set 4 70 0 0 0 0 0 0 255
32846076 Calling SendFeedbackData from SendDataIfReady
32846344 ...waiting for ACK on try 0
32846485 ...received 249 // SHOW_IN_PROGRESS
32846760 ...waiting for ACK on try 1
32848008 ...waiting for ACK on try 2
32849229 ...waiting for ACK on try 3
32850495 ...waiting for ACK on try 4
32851733 ...waiting for ACK on try 5
32852957 ...waiting for ACK on try 6
32853417 ...received 170 // ACK

I’ve observed two stuck pixels so far, with exactly the same structure (try 0 gets 0xF9, timeouts until try 6 which gets 0xAA).

Other strangeness is present in the log but I won’t be able to look at it until tonight.

EDIT: Yep, this is quite consistently associated with stuck pixels. All the extra .print() and .println() calls seem to massively increase the incidence (which is independently interesting).

I just got eight stuck pixels (7 of them in the last position, one in the first) in one 16-step march with a 500ms delay. Every one of them got SHOW_IN_PROGRESS (249, 0xF9) back from try 0 instead of ACK. They took a variable number of retries to get 170: 2 more (n=1), 5 more (n=1), 6 more (n=3), 7 more (n=2), and 10 more (n=1, with 248 (CHECKSUM_ERROR) on retry #8 and 250 (SHOW_END) on retry #9). So 5-7 retries before ACK is typical but the distribution extends in both directions.

Additionally, one other update got 250 (SHOW_END) back from try 0, but 249 on try 1, and that one seems to have updated (not 100% sure).

Hi @gc3!
Yes! I did a similar test.

It’d seem that from time to time the SAMD21 fails to realize that the SAMD11 is showing the neo pixels… so it tries a couple of times to send the same message

Until it gets an ACK and then follows on.

I sitll can’t seem to relate this to a stuck pixel every time.

I improved a bunch of stuff that was a bit off in the SAMD11 code, I will update the repo in this branch tomorrow.

Seems to be an improvement… I got 5-6 minutes without mistakes a couple of times.

There’s something I tested, which is making the SAMD21 signal the SAMD11 it can start to show pixels, but for now haven’t got it to work better than the current strategy, which is show regularly if there are changes to be made.

Still work to do! But I’m confident we’re narrowing it down!

1 Like

Hi @gc3!

Let’s try with this set:
ytx-v2-firmware-aux-0-14-testing.bin (11.1 KB)
ytx-v2-firmware-main-0-14-testing.bin (74.5 KB)

I came up with something of a way to check how good the MAIN <-> AUX comms are working and it is by just counting how many failed events happen when MAIN tries to send a message.
It is printing this every time it happens.
It will print total accumulated number of fails, then if it failed because of timeout you will see the timeout microseconds, and then “3,X” or “4,X” where 3 or 4 are encoder switch or digital and X is the index that failed, and then you will see the read index and write index of the feedback buffer.
These prints are on lines 890-903 of feedback.ino in branch bugFixFeedbackUpdate.

I tried 2 times for 5 minutes sending 4 notes ON and 4 notes OFF every 100 milliseconds and got no failed events.

I can’t see any more stuck or no-show pixels.

Please test them when you can :slight_smile:
If you still see stuck pixels, we can try lowering the Serial baud of the main <-> aux link.

1 Like

Heyo! Awesome! Thank you so much for pouring your time into this. It speaks REALLY well of you and your company. You’ve gotten at least one lifetime customer out of this beta :slight_smile:

Initial test was encouraging (no stuck pixels after 2min, new record, with 6 every 125ms). Will test more thoroughly.

Not at all to diminish your apparent heroic victory over stuck pixels but I can’t tell if the other problem (timing breakdown on encoder turn) is better or not–probably a little, but it’s still there. I’ve been pretty sure from my code review and tests that they’re not that related. Will report back on that too–trying some things to smoke it out.

1 Like

One problem at a time!

I think the stuck pixel problem is coming to an apparent solution, I’ll start working on the encoder performance issue soon :wink:

2 Likes

One problem at a time–yes, absolutely!!! :slight_smile: I was just hoping that maybe they were the same problem after all… But, for what it’s worth, the lag on encoder turn does seem subjectively better than v0.13 (it’s hard to measure).

I have done a few more tests and seen nothing sticking. I’ve been distracted from tests, though, because I’m now up and coding on my first application and quite confident that it will work.

You’re amazing!

Will report back when I do some more tests.

2 Likes