Topics

Image Captioning Add-on!

Shubham Jain
 

Hello Everyone!

I am very excited to announce a pre-release version of my Image Captioning add-on! You can download it here: https://github.com/ShubhamJain7/imageCaptioning-NVDA-Addon/releases/tag/v0.1-alpha
This add-on allows users to perform image captioning on image elements present on their screen and get a caption that describes the image in English. The result is announced to the user and also presented in a virtual window that allows users to access the result character-by-character, word-by-word, as a whole and even copy the result.
Detection can be triggered by pressing Alt+NVDA+C or Alt+NVDA+C+C+...
The former only performs detection if the navigator object currently in focus has the role ROLE_GRAPHIC. This prevents non-visual users from waiting for bad results after mistakenly starting a captioning process on non-image elements. Low-vision users or otherwise can press Alt+NVDA+C+C+.. to perform captioning on any element without filtering out non ROLE_GRAPHIC roles.
The result is announced as soon as it is available. This announcement is then followed by RESULT_DOCUMENT which indicates that focus has been shifted to a "virtual result window". Users can then use arrow-key navigation to access the result character-by-character, word-by-word or as a whole. Pressing ESC or changing focus to another element on the screen escapes the "virtual result window".
The result can be re-accessed by pressing Alt+NVDA+R.

As is the case with most open-source image captioning models available, the results produced can be wrong at times. The model can also produce different results for the same image at different sizes. For images in which objects could not be easily identified, the model takes quite some time to produce any results. In some cases, it may be slow the first time it is triggered.

I would be very grateful if you could test my add-on and share your feedback with me. If you have any issues with the add-on or would like to request any changes feel free to reach out to me or create an issue at the Github repository: https://github.com/ShubhamJain7/imageCaptioning-NVDA-Addon/

Noelia Ruiz
 

Hi, as always, many thanks for this interesting project. I may create
issues on GitHub in a more advanced stage of the add-on, when easy
problems like conflict with known commands are fixed. For now I prefer
to provide feedback here since issues may be more useful for reporting
problems which require more investigation, unless you prefer a
different approach to receive feedback:

- The link provided has shown me a version where NVDA+alt+d is used,
not NVDA+alt+c. May be the previous one or something, not sure.
-. I suggest you not to use NVDA+alt+c, since this is used for
reporting comments in Excel and Work (you may see NVDA quick reference
about commands or the user guide.
- I have cloned your repo and build the add-on my self, and now NVDA+c
is working as described. Anyway, for now I havent been able to get any
recognized image.
- Sometimes, when pressing g in browse mode and then NVDA+alt+c, the
add-on announces that this is not an image, but using the object
navigator and placing it inside (in the first child), NVDA detects the
graphic as such, thoug recognition fails.
- How can be images enlarged if possible using the add-on? Sometimes
it announces that the image is too small.
I have tried, among other places, at
https://www.freepik.com/free-photos-vectors/graphics

Kind regards

2020-08-01 18:39 GMT+02:00, Shubham Jain <@ShubhamJain>:

Hello Everyone!

I am very excited to announce a pre-release version of my *Image Captioning*
add-on! You can download it here:
https://github.com/ShubhamJain7/imageCaptioning-NVDA-Addon/releases/tag/v0.1-alpha
This add-on allows users to perform image captioning on image elements
present on their screen and get a caption that describes the image in
English. The result is announced to the user and also presented in a virtual
window that allows users to access the result character-by-character,
word-by-word, as a whole and even copy the result.
Detection can be triggered by pressing Alt+NVDA+C or Alt+NVDA+C+C+...
The former only performs detection if the navigator object currently in
focus has the role ROLE_GRAPHIC. This prevents non-visual users from waiting
for bad results after mistakenly starting a captioning process on non-image
elements. Low-vision users or otherwise can press Alt+NVDA+C+C+.. to perform
captioning on any element without filtering out non ROLE_GRAPHIC roles.
The result is announced as soon as it is available. This announcement is
then followed by RESULT_DOCUMENT which indicates that focus has been shifted
to a "virtual result window". Users can then use arrow-key navigation to
access the result character-by-character, word-by-word or as a whole.
Pressing ESC or changing focus to another element on the screen escapes the
"virtual result window".
The result can be re-accessed by pressing Alt+NVDA+R.

As is the case with most open-source image captioning models available, the
results produced can be wrong at times. The model can also produce different
results for the same image at different sizes. For images in which objects
could not be easily identified, the model takes quite some time to produce
any results. In some cases, it may be slow the first time it is triggered.

I would be very grateful if you could test my add-on and share your feedback
with me. If you have any issues with the add-on or would like to request any
changes feel free to reach out to me or create an issue at the Github
repository: https://github.com/ShubhamJain7/imageCaptioning-NVDA-Addon/



Shubham Jain
 

Hello Noelia,
I don't mind feedback anywhere. Thanks for taking the time to provide it :)

The problem with the gesture is odd. I've just checked my local code and the code in the release files and they all use the gesture Alt+NVDA+C. The repo itself has only one commit/version which also uses Alt+NVDA+C. Are you perhaps using the other add-on I am developing? The Object detection add-on that makes use of the Alt+NVDA+D gesture? And yes, these gestures aren't final. For release versions of these add-ons will ensure that there is no gesture conflict.

You can use Alt+NVDA+C+C... (pressing the C key two times or more) to perform captioning on any element. Websites sometimes wrap images in anchor tags and so they are not identified as images. For such cases, the Alt+NVDA+C+C... is faster than navigating to the child image element. Looking at the link you provided, I see that most images are "graphic vectors". The model used in this add-on can only work with images with "natural settings". Ie. Images of people, animals and objects and not images made using graphic programs. You can test the add-on on google image results for keywords such as people, giraffes, cycling etc.

It is not possible to enlarge the image with the add-on itself but that might be a good feature to add. It is, however, possible to resize images in photo-viewing applications such as the default Microsoft Photos application or even vary the window size. It would be helpful if you could provide the log messages when recognition fails.

Thanks!

ChrisLM
 

Yes, I agree, very interesting project.

If I have not misunderstood, when move in browser mode to a grafic by G key, if the object is not recognized as an image you should press Nvda+Alt+C twice, instead of moving the object navigator inside.

I love the function "isScreenCurtainEnabled()". A similar function  might be useful for windows OCR used in NVDA.


Thanks!

Chris.

Noelia Ruiz ha scritto il 02/08/2020 alle 06:30:

Hi, as always, many thanks for this interesting project.
- Sometimes, when pressing g in browse mode and then NVDA+alt+c, the
add-on announces that this is not an image, but using the object
navigator and placing it inside (in the first child), NVDA detects the
graphic as such, thoug recognition fails.

Noelia Ruiz
 

Thanks Chris, I have tested with and without moving the object navigator inside the graphic thinking that we can find some difference, just for testing.
See you soon.
Ciao

Enviado desde mi iPhone

El 2 ago 2020, a las 10:39, ChrisLM <@Christianlm> escribió:

Yes, I agree, very interesting project.

If I have not misunderstood, when move in browser mode to a grafic by G key, if the object is not recognized as an image you should press Nvda+Alt+C twice, instead of moving the object navigator inside.

I love the function "isScreenCurtainEnabled()". A similar function might be useful for windows OCR used in NVDA.


Thanks!

Chris.

Noelia Ruiz ha scritto il 02/08/2020 alle 06:30:
Hi, as always, many thanks for this interesting project.
- Sometimes, when pressing g in browse mode and then NVDA+alt+c, the
add-on announces that this is not an image, but using the object
navigator and placing it inside (in the first child), NVDA detects the
graphic as such, thoug recognition fails.


Noelia Ruiz
 

Thanks Shubham, definitely, I was using your other add-on :)
I will test with natural images and will provide more feedback.
Cheers

Enviado desde mi iPhone

El 2 ago 2020, a las 10:09, Shubham Jain <shubhamdjain7@...> escribió:

Hello Noelia,
I don't mind feedback anywhere. Thanks for taking the time to provide it :)

The problem with the gesture is odd. I've just checked my local code and the code in the release files and they all use the gesture Alt+NVDA+C. The repo itself has only one commit/version which also uses Alt+NVDA+C. Are you perhaps using the other add-on I am developing? The Object detection add-on that makes use of the Alt+NVDA+D gesture? And yes, these gestures aren't final. For release versions of these add-ons will ensure that there is no gesture conflict.

You can use Alt+NVDA+C+C... (pressing the C key two times or more) to perform captioning on any element. Websites sometimes wrap images in anchor tags and so they are not identified as images. For such cases, the Alt+NVDA+C+C... is faster than navigating to the child image element. Looking at the link you provided, I see that most images are "graphic vectors". The model used in this add-on can only work with images with "natural settings". Ie. Images of people, animals and objects and not images made using graphic programs. You can test the add-on on google image results for keywords such as people, giraffes, cycling etc.

It is not possible to enlarge the image with the add-on itself but that might be a good feature to add. It is, however, possible to resize images in photo-viewing applications such as the default Microsoft Photos application or even vary the window size. It would be helpful if you could provide the log messages when recognition fails.

Thanks!

Noelia Ruiz
 

Hello Shubham and all:

I have tried on this webpage without success:

https://www.medicalnewstoday.com/articles/322868#How-dogs-keep-you-in-good-health

I searched this via Google Images. Since I don't use this Google
service, it's possible that I have made some mistake:
1. I clicked on Google Images.
2. In the edit combo box, I have searched for results related to dogs.

also, I have opened Thorium Reader trying to detect the cover of a
book and other graphics without success. You can search for messages
containing "fail", "too small" and "is not an image" in this debug
NVDA's log. Hope this help.

Shubham Jain
 

Hello Noelia,

Thanks for taking the time to test my work and for providing all the details.
I have tried and been unsuccessful in trying to replicate the issues you faced. I am able to produce results on the website you shared and google image search results too.
I will take a closer look at this problem and get to the root of it. Hopefully, all these issues will be fixed in the next release. Stay tuned!

regards,
Shubham Jain