[FIXED] Apparent overloading when moving RGB rotary encoder

francoytx · September 14, 2020, 2:37pm

We’re very glad that you are happy with your controller and that you already started looking inside the firmware and understanding it!

I am sorry it’s giving you some trouble with the feedback but I’m sure we can find out the reason.

24 messages per second don’t seem a lot, and we even tested the controllers with remote scripts and RtMIDI sending a lot more, and we found no apparent issues, but we want to test something close to your setup to see what’s going on.

Can you please share a video or a GIF of the issue? And if possible, the code you are using for RTMIDI?

You are right about the SAMD11 driving the feedback for performance reasons!
Too many LEDs require too much time for the processor and if all was done in a single microcontroller, the performance would be really really bad.

And for the single NeoPixel in the main SAMD21, we use the Adafruit library which, as you correctly say, does disable interrupts, but as the Status LED is a single LED, the interrupts are disabled for a very short period of time.

At 800 kbits/s (the neopixel data rate) a single bit requires 1.25 us to be sent, and a single LED requires 24 bits, so the complete update is done in 30 us.

I doubt that this alone can harm the feedback performance, but, I think it’s worth giving a try with your setup to comment either the UpdateStatusLED() in the loop or the library’s show() funcion inside.

This LED has given me some other rather obscure issues, and I tried other libraries and the problems persisted, so for now I stayed with the Adafruit library, but if we prove that there are performance issues or any other problems, and that a different library can help solve them, we got no problem to change the library.

Let me know if you do further testing!

gc3 · September 14, 2020, 4:01pm

@francoytx, thanks so much for your quick reply! It’s very clarifying, helpful, and encouraging.

The firmware recompilation went very smoothly; you’ve set it up extremely well. Thanks for that too!

Unfortunately (or maybe fortunately?) you’re right: turning off the StatusLED update entirely doesn’t seem to change the situation at all (turns out that 30us is not a long time!)

I’m trying a few more things now (since I’m in-studio for a bit and can actually experiment, rather than just theorizing!) and will keep you posted. I’ll also send you some video when I get a chance.

gc3 · September 14, 2020, 5:43pm

OK, interesting. I turned off a bunch of stuff in the loop, so now it’s literally just:

void loop() {
    antMicrosLoop = micros();

    CheckSerialSAMD11();

    if(enableProcessing){
        digitalHw.Read();
        feedbackHw.Update();  
    }
   // power change and speed print stuff
}

Obviously, with the encoder reading turned off, I can’t observe the encoders stepping on the updates. But I do see another problem (which I was also seeing before), which is LEDs occasionally getting stuck when I have the six-button column marching across the screen. It looks like they’re getting the V127 note but not the V0 note. It can take a few sweeps to see, but I always notice it after enough time. It doesn’t seem to be linked to activity on the host computer.

My instinct is that the problems may be related: if 24 notes/second causes occasional dropouts with the encoders turned off, I’m not surprised that doing something relatively heavy in the loop makes it worse. So I’m going to work on the stuck LED case (since it’s simpler) and see if that helps. Video to follow.

Here’s the test code I’m using, hacky and basic. Set MARCH=true to do the sweep (for identifying stuck pixels) and MARCH=false for a blink in place (easier to see the encoder timing interference, although it can be seen in the sweep as well).

#include <iostream>
#include <cstdlib>
#include <chrono>
#include <thread>
#include "RtMidi.h"

int main()
{
    RtMidiOut* midiout = 0;

    // right from the docs
    try {
        midiout = new RtMidiOut();
    }
    catch (RtMidiError& error) {
        error.printMessage();
        exit(EXIT_FAILURE);
    }
    
    // Print outputs (to faciliate selection of CHOSEN_OUT_PORT).
    // right from the docs except less DRY but no goto and port #s aren't bumped +1
    unsigned int nPorts = midiout->getPortCount();
    std::cout << "\nThere are " << nPorts << " MIDI output ports available.\n";
    std::string portName;
    for (unsigned int i = 0; i < nPorts; i++) {
        try {
            portName = midiout->getPortName(i);
        }
        catch (RtMidiError& error) {
            error.printMessage();
            delete midiout;
            return 0;
        }
        std::cout << "  Output Port #" << i << ": " << portName << '\n';
    }
    std::cout << '\n';

    unsigned int CHOSEN_OUT_PORT = 1;
    midiout->openPort(CHOSEN_OUT_PORT);

    // YAELTEX V2 digital button config
    // Code assumes the target buttons start w/Note 0 in the upper left
    // and are mapped 0..NCOLS-1, NCOLS..2*NCOLS-1, etc.
    unsigned int NCOLS = 16;
    unsigned int NROWS = 6;

    // Output routine config
    unsigned int RUNS_TO_DO = 1000;
    unsigned int BASE_DELAY_MS = 125; // doubled for MARCH to keep the per-second throughput the same
    bool MARCH = false; // are we doing a sweep, or a blink in place?
    unsigned int NONMARCH_COL = 8; // column to blink in place if MARCH == false

    // Output routine begins
    std::vector<unsigned char> message = { 0, 0, 0 };

    for (int run = 0; run < RUNS_TO_DO; ++run)
    {
        for (unsigned char on_col = 0; on_col < NCOLS; ++on_col)
        {
            unsigned char off_col = (on_col == 0) ? (NCOLS - 1) : (on_col - 1);
            for (unsigned char row = 0; row < NROWS; ++row)
            {
                if (MARCH || (on_col % 2 == 0))
                {
                    message.assign({ 144, (unsigned char)(row * NCOLS + ((MARCH) ? on_col : NONMARCH_COL)), 127 });
                    midiout->sendMessage(&message);
                }
                if (MARCH || (on_col % 2 == 1))
                {
                    message.assign({ 144, (unsigned char)(row * NCOLS + ((MARCH) ? off_col : NONMARCH_COL)), 0 });
                    midiout->sendMessage(&message);
                }
            }
            std::this_thread::sleep_for(std::chrono::milliseconds(BASE_DELAY_MS * ((MARCH) ? 2 : 1)));
        }
    }

    delete midiout;
    return 0;
}

francoytx · September 14, 2020, 6:47pm

Load this to the AUX microcontroller and try again.

There is a timer in the AUX controller that updates every LED_SHOW_TICKS milliseconds whichever strip had changes (only 1 at a time, this may be problematic, now that I think about it)

I changed LED_SHOW_TICKS to 12 ms for you, but this is normally on 20 ms (configurations with all the encoders and all the buttons available take longer than smaller configurations)

Let’s see how it works
seq616-aux-firmware.bin (11.0 KB)

gc3 · September 14, 2020, 6:52pm

ah, many thanks! I haven’t figured out how to compile the aux firmware yet so this is very helpful.

I’ll reset to factory 0.13 main and then load this.

francoytx · September 14, 2020, 6:57pm

I didn’t post about it, but if you get Atmel Studio, just loading the SAMD11 project should work straightforward.

Compile for ATSAMD11D14AM. It’s included in the project settings though.

You hit “Build” and the binary file shows up in the ./Debug folder

gc3 · September 14, 2020, 7:09pm

I went ahead and grabbed it, and it did pick up the project settings, but I got some initial errors and didn’t dig further (figured I’d learn one firmware at a time ). It’s probably just library linking stuff.

Trying new firmware now!

…Interesting. The encoder timing problems feel just the same as with factory firmware (I don’t have an objective measure of them). They’re definitely still there.

However, good news: I did get a few minutes without a stuck pixel in the MARCH test. Could be luck (it’s a stochastic problem) but I don’t think I’ve run the test that long without seeing one before. That test I can measure objectively (# of stuck pixels observed for N transits) and I’ll do that, for a low N, in a bit.

Do you remember if the strips are arranged in rows or columns on my build? If I understand the one-at-a-time limit, that might explain it, as that problem shows up when I’m blanking one column and setting the next in the same set of messages.

I’ll send video now (to your support email?) so that you won’t have to keep debugging based on my vague descriptions!

francoytx · September 14, 2020, 7:06pm

The samd11 doesn’t rely on libraries, just the ASF (Atmel Software Framework) which gets installed along with Atmel Studio. If you try again, post the errors and we can try to figure it out.

gc3 · September 15, 2020, 7:52pm

I’ve had only a moment to look at this today, but here’s what seems like a key observation (regarding the lag problem, not the “stuck pixel” problem, which may or may not be connected).

Basically, it doesn’t look like the encoder ring feedback is what’s fighting the digital updates. I’ll try to keep narrowing in on the problem (assuming it’s single-caused).

This is on my minimal main firmware build, which simplifies the loop (as above) and implements the priority system. I’m using the updated version of your aux firmware from yesterday.

In addition to that, I commented out FillFrameWithEncoderData(fbUpdateQueueIndex); and SendDataIfReady(); under the FB_* cases. So only the digital buttons are being updated. The encoders are still sending commands, though (confirmed in MidiOx); they’re just not getting the ring/button feedback.

Twisting the encoders still interferes with the “march” very consistently. Sometimes it freezes for three or four beats and then jumps ahead. if anything, it seems worse (?!) than when the encoder feedback update was on. As before, it’s nice and regular timing-wise when the encoders aren’t tuning, although the stuck pixels still happen on their own. (I can send video later if this description’s not enough).

gc3 · September 16, 2020, 5:51am

I got the SAMD11 build working after a moment’s look at the error messages.

In the subcompilation, NeoPixels.c wasn’t finding NeoPixels.h, so I changed #include <NeoPixels.h> to #include "NeoPixels.h";
Then, in the main compilation, main.c wasn’t finding NeoPixels.h, so I changed #include <NeoPixels.h> to #include "NeoPixel/NeoPixels.h" (and #include <ytxHeader.h> to #include "ytxHeader.h", pre-emptively). asf.h didn’t require a change.

I don’t know how exactly Atmel Studio tweaks the Visual Studio base but it looks as though src/NeoPixel wasn’t getting added to the project path (which is very weird as I see it in the .cproj file under <armgcc.compiler.directories.IncludePaths>). Anyway, easy fix, and now I can poke at the aux code too. Thanks for your help!

gc3 · September 16, 2020, 7:46am

OK, here’s one theory (at least about the stuck pixels).

timer.TC5_Handler is getting called every 83.33ms (12Hz) as an interrupt service routine (ISR) to handle incoming MIDI and MIDIHW.

I’m a software person, not a hardware person, but my understanding of ISRs is that they’re supposed to do very little (like enqueue a value and then return immediately). If I’m reading the code right, TC5_Handler:

calls MIDI.read (and MIDIHW.read, but I’ll just focus on the USB side here), which:
- Parses the entire serial buffer, yielding up to as many messages as can fit in it;
- Potentially calls a callback for each method (as registered in setup), each of which:
  - Does some parsing of its own, and then calls ProcessMIDI, which:
    - Does some more message preparation, and then calls UpdateMIDIBuffer for each feedback type, which:
      - does a bunch more stuff, including a binary search for affected encoders.

I haven’t benchmarked it, and of course I may be misreading/misunderstanding, but if that’s all happening during the ISR, it seems like a lot! I wonder if the stuck pixels come from an occasional timing clash between that process and either the feedback or new MIDI reads.

I don’t know the reason that you put MIDI reads on an interrupt (and not in the loop), but if it needs to stay there, have you considered using the interrupt to do nothing but read raw MIDI messages into a fixed-length FIFO queue in memory (and then immediately return), and then consuming that queue (parsing the buffer into MIDI messages, dispatching to callbacks, doing your internal parsing and prep, then calling your own UpdateMIDIBuffer to fill the “cooked” buffer) in the main loop?

If you need that process to happen less (or more) frequently than the other items in the loop you could only call it (or call the other functions) mod N while maintaining a loop counter.

Anyway, I may be completely off-base, and I’m not sure how this would be affecting the encoder timing issue, but If I have time tomorrow I might pull MIDI.read() out of the interrupt and into the loop and see if the stuck pixels still appear occasionally (or if something else blows up!)

francoytx · September 16, 2020, 2:40pm

This is a long post, answering bits of many of your previous ones, sorry for this, but I was taking my time to test and write this.

I totally missed the last part of this message, sorry for that.

So with the 12 ms show() time this improved… it’s still interesting because from what I see in the videos the timing is very slow for this to come up as a problem…

I’m testing with Turn which has 96 (32 encoder switches + 64 buttons) digital inputs, sending (with Max4Live) groups of 6 notes every 40 ms. That results in 12 (6 notes ON + 6 notes off) * 1000/40 = 12*25 = 300 messages per second and I don’t see stuck LEDs

In your controller there is only 2 strips being updated, one for the encoders and one for the buttons. The digital strips are horizontal, and in groups of 16.
In this image you can see the LED indexes for your model:
seq616-led-index

I also tried updating 4 buttons every 500 ms and turning the encoders and can’t reproduce the glitch you have.
Again, I am using Max for Live instead of RtMIDI, but this shouldn’t be relevant…
Here is the max patch in case you have a way to test it: SeqTestV2.amxd (69.7 KB)

Not really, I know I left the original comments there, but the 12 in the line
uint32_t sampleRate = 12; //sample rate, determines how often TC5_Handler is called

Is not really 12 hz. Actually the timer interrupts every 350 uS. If you set that to 13, it jumps to 3.6 ms. Set it to 11 and it interrupts every 6.3 ms. It’s weird, I know, I left it working at 350us promising myself to have more control over it in the future. The time may have arrived

I’m using micros() to measure it, I’m sure there’s a better way, but what I get is pretty consistent, so I’m guessing it is accurate enough.

To measure this I used the following code:

void TC5_Handler (void) {
  SerialUSB.println(micros()-antMicrosLoop);
  // Call MIDI read function and if message arrived, the callbacks get called
  MIDI.read();

  MIDIHW.read();
  
  // END OF YOUR CODE
  TC5->COUNT16.INTFLAG.bit.MC0 = 1; //Writing a 1 to INTFLAG.bit.MC0 clears the interrupt so that it will run again
  antMicrosLoop = micros();
}

The complete timer ISR, when no messages arrive, takes 9 us. When a note arrives, the ISR takes 130-180 us to complete for each message.

The code to measure this is similar:

void TC5_Handler (void) {
  antMicrosLoop = micros();
  // Call MIDI read function and if message arrived, the callbacks get called
  MIDI.read();

  MIDIHW.read();
  
  // END OF YOUR CODE
  TC5->COUNT16.INTFLAG.bit.MC0 = 1; //Writing a 1 to INTFLAG.bit.MC0 clears the interrupt so that it will run again
  SerialUSB.println(micros()-antMicrosLoop);
}

We placed the MIDI reads in the ISR because it works A LOT better when there are dumps of incoming messages, i.e.: when ableton has a track select event with a script dumping all the track data.
The loop takes from 500 to 2500 us to complete, depending on how many components it has to poll, so if the MIDI reads are there, it can fail to receive all the messages.

I think the stuck LEDs problem is on the SAMD21 <-> SAMD11 communication, and not on the MIDI reception.
We tested the midi reception sending with RtMIDI a specific amount of messages/second and counting them on the controller. We got to the point of receiving all of them with no drops.

We’ll work on it until we get it right!

gc3 · September 16, 2020, 4:17pm

@francoytx: nobody should apologize to me for making long posts

(BTW, please never feel any pressure to respond in detail, or at all, on these theories–I don’t want to overwhelm you with message processing and stop your loop from doing actual work! I know you’re putting a TON of effort into debugging this ASAP)

Anyway, thanks for the great information! That’s very good to know, esp. the frequency difference on the TC5 (I made assumptions rather than measuring…)

I’ll look at the M4L patches and think about reproducibility. (Not sure whether 12ms improved things or not, BTW, I haven’t sat down and counted pixels yet).

Totally agree that MIDI reception doesn’t seem to be the problem–as I said to @mike elsewhere, I can flash ALL 96 buttons at a very high rate and it looks fine (as long as I’m not turning an encoder). My suspicion is that there are weird timing interactions between the two controllers (probably explaining both issues).

As to the stuck LED one, my hypothesis is that when a “heavy” ISR happens (meaning it’s processing several messages at once and getting into the ms range) at the wrong time, that’s when things get stuck. If that’s what’s happening, we would look first at timing-dependent behaviors related to SAMD21/SAMD11 communication.

Two things jump out at me (no time to play with them now):

the 1ms blocking while loop in feedback.SendFeedbackData:892, waiting for ACK (0xAA), could take unpredictably long if the interrupt hits in the middle of it;
the SAMD11 is sending SHOW_END approximately every LED_SHOW_TICKS (subject to its own interrupt), because that gets fired whenever timeToShow is true since it’s outside the conditional, even though SHOW_IN_PROGRESS is only sent when ledShow && (!receivingBank || !receivingLEDdata).

Could SHOW_END be stepping in front of the ACK so that the wait logic never sees it and has to retry too many times? Could it occasionally be filling the incoming serial buffer on the SAMD21 and blocking the ACK from getting saved (I think if the ring buffer fills then writes are just dropped)?

Also, I’m sure you know this, but isn’t micros() limited (see discussion here) to a fairly small range during interrupts, after which it returns a value mod that range? So if the midi ISR were consistently taking (say) ~2130 to 2180us, you might be seeing that as 130 to 180us in your measurement code (precise/consistent but not accurate). No idea whatsoever if this is actually happening, of course!

gc3 · September 16, 2020, 6:12pm

Quick experimental results:

Moving SHOW_END into the condition doesn’t seem to change anything (and may create other problems).
However, putting MIDI.read() in the loop (and commenting it out in the ISR) seems, on a first run, to significantly improve (but NOT eliminate) the stuck pixels. I had to wait about 2 minutes to see the first one; usually I see one within 20-30 seconds.

I’m going to block out 20 minutes and actually do repeated tests later. Just wanted to give you initial impressions.

gc3 · September 16, 2020, 7:17pm

And a last note for now: I got .println() debugging working in Arduino IDE (level up!) and can more or less confirm that MIDI notes are reaching the callback functions in the correct order and groups even when stuck pixels happen.

francoytx · September 16, 2020, 7:49pm

well I think you actually read the incorrect comments on the code, so your assumptions were based on something present in the code haha

I saw this and already fixed it. It now sends “SHOW_END” only if show() actually begins.

I know millis() inside an ISR is not available, but I think micros() is.
The discussion you link to is about AVR which have 8/16 bit registers and might overflow. I think this is less of a problem in SAMD arch and micros() is quite reliable. A different way to test this can be with an oscilloscope setting a pin HIGH entering the ISR and LOW when leaving. Can be very useful with stable ISR durations, but might be hard to track when a big difference in timing occurs. Anyhow, like we’ve been saying, the problem seems to be somewhere else.

Great! Yeah, like i said before, MIDI reception is pretty solid. I’d aim at the buffering, and the SAMD21 <-> SAMD11 communication.

This affects the MIDI reception, if you try to populate the 96 buttons like you said in mike’s post, you’ll probably get a poorer performance with the MIDI reads in the loop.

Thanks for these reports!

francoytx · September 16, 2020, 9:47pm

I did a 15 minute watch test sending notes to light up 4 LEDs at a time every 500 ms and I got 9 pixels that didn’t turn ON and 2 pixels that didn’t turn OFF.

In my case I am almost certain it was always the first of the 4 LEDs that didn’t turn on. The ones that didn’t turn OFF were not the same position
There’s a starting point.

gc3 · September 18, 2020, 5:41am

Great! Thanks. Glad you can reproduce. If you up the pixel/push count you can probably get it to stick reliably within the first 30s (less boring for debugging). If we don’t get this figured out soon I’ll start some data gathering about what affects rates, maybe with a webcam and some simple machine vision to do the counting. But I think if the right diagnostic output is on when a pixel sticks, cause will be pretty easy to figure out.

Will pick up other points later, & thanks for the clarification on micros(); if it’s not obvious, I know VERY little about actual hardware and am just analogizing from distributed systems programming

gc3 · September 19, 2020, 2:20am

I have log output which seems to differentiate failed updates from actual updates. (This is just a tiny snippet of the log but it’s the part that seems relevant).

This one (a NOTE_ON) successfully updated:

32842882 Calling FillFrameWithDigitalData 232
32843078 sendSerialBufferDec set 4 71 0 1 0 230 8 14 255
32844137 Calling SendFeedbackData from SendDataIfReady
32844411 ...waiting for ACK on try 0
32844583 ...received 170 //ACK

The very next call sequence (processing a NOTE_OFF) failed to update, resulting in a stuck pixel:

32844829 Calling FillFrameWithDigitalData 233
32845037 sendSerialBufferDec set 4 70 0 0 0 0 0 0 255
32846076 Calling SendFeedbackData from SendDataIfReady
32846344 ...waiting for ACK on try 0
32846485 ...received 249 // SHOW_IN_PROGRESS
32846760 ...waiting for ACK on try 1
32848008 ...waiting for ACK on try 2
32849229 ...waiting for ACK on try 3
32850495 ...waiting for ACK on try 4
32851733 ...waiting for ACK on try 5
32852957 ...waiting for ACK on try 6
32853417 ...received 170 // ACK

I’ve observed two stuck pixels so far, with exactly the same structure (try 0 gets 0xF9, timeouts until try 6 which gets 0xAA).

Other strangeness is present in the log but I won’t be able to look at it until tonight.

EDIT: Yep, this is quite consistently associated with stuck pixels. All the extra .print() and .println() calls seem to massively increase the incidence (which is independently interesting).

I just got eight stuck pixels (7 of them in the last position, one in the first) in one 16-step march with a 500ms delay. Every one of them got SHOW_IN_PROGRESS (249, 0xF9) back from try 0 instead of ACK. They took a variable number of retries to get 170: 2 more (n=1), 5 more (n=1), 6 more (n=3), 7 more (n=2), and 10 more (n=1, with 248 (CHECKSUM_ERROR) on retry #8 and 250 (SHOW_END) on retry #9). So 5-7 retries before ACK is typical but the distribution extends in both directions.

Additionally, one other update got 250 (SHOW_END) back from try 0, but 249 on try 1, and that one seems to have updated (not 100% sure).

francoytx · September 19, 2020, 5:18am

Hi @gc3!
Yes! I did a similar test.

It’d seem that from time to time the SAMD21 fails to realize that the SAMD11 is showing the neo pixels… so it tries a couple of times to send the same message

Until it gets an ACK and then follows on.

I sitll can’t seem to relate this to a stuck pixel every time.

I improved a bunch of stuff that was a bit off in the SAMD11 code, I will update the repo in this branch tomorrow.

Seems to be an improvement… I got 5-6 minutes without mistakes a couple of times.

There’s something I tested, which is making the SAMD21 signal the SAMD11 it can start to show pixels, but for now haven’t got it to work better than the current strategy, which is show regularly if there are changes to be made.

Still work to do! But I’m confident we’re narrowing it down!