I don't mind feedback anywhere. Thanks for taking the time to provide it :)
The problem with the gesture is odd. I've just checked my local code and the code in the release files and they all use the gesture Alt+NVDA+C. The repo itself has only one commit/version which also uses Alt+NVDA+C. Are you perhaps using the other add-on I am developing? The Object detection add-on that makes use of the Alt+NVDA+D gesture? And yes, these gestures aren't final. For release versions of these add-ons will ensure that there is no gesture conflict.
You can use Alt+NVDA+C+C... (pressing the C key two times or more) to perform captioning on any element. Websites sometimes wrap images in anchor tags and so they are not identified as images. For such cases, the Alt+NVDA+C+C... is faster than navigating to the child image element. Looking at the link you provided, I see that most images are "graphic vectors". The model used in this add-on can only work with images with "natural settings". Ie. Images of people, animals and objects and not images made using graphic programs. You can test the add-on on google image results for keywords such as people, giraffes, cycling etc.
It is not possible to enlarge the image with the add-on itself but that might be a good feature to add. It is, however, possible to resize images in photo-viewing applications such as the default Microsoft Photos application or even vary the window size. It would be helpful if you could provide the log messages when recognition fails.