The debate over AI rages on, and I find myself caring less and less as the tug of war between the sides, one saying that AI is a threat to humanity and the other side saying that AI can do lots of amazing stuff, and definitely couldn’t take our jobs, becomes more fierce. No, AI cannot take our jobs. Rich people can take our jobs and give it to AI, though.
This post isn’t going to be about the rich people that bend AI, and anything else they can, to their will. This post is about why I use large language models, especially multimodal ones, and why I find them so useful. A lot of people without disabilities, particularly those who aren’t blind, probably won’t understand this. That’s okay. I’m writing this for myself, and for those who haven’t gotten to use this kind of technology yet.
Text only models
ChatGPT was the first large-language model I used. It introduced me to the idea, and to the issues of the model. It couldn’t give an accurate list of screen reader commands. But it could tell me a nice story about a kitten who drinks out of the sink. From the start, I wondered if I could feed the model images. I tried with Ascii art, but it wasn’t very good at describing that. I tried with Braille art, but it wasn’t good at that either. I even tried with an SVG, but it couldn’t fit the whole thing into the chat box.
I was disappointed, but I kept trying different tasks. It was able to explain output of some Linux commands, like Top, which doesn’t read well with a screen reader. It was even able to generate a little Python script that turned a CSV file into an HTML table.
As ChatGPT improved, I found more uses for it. I could ask it to generate a description of a video game character, or describe scenes from games or TV shows. But I still wanted it to describe images.
My Fascination with Images
I’ve always wanted to know what things look like. I’ve been blind since birth, so I’ve never seen anything. From video games to people to my surroundings, I’ve always wondered what things look like. I guess that’s a little strange in the blind community, but I’ve always been a little strange in any community.So many blind people don’t care what their computer interface looks like, or what animations are like, or even if there is formatting information in a document or book. I do. I love learning about what apps look like, or what a website looks like. I love reading with formatted Braille or speech, and learning about different animations used in operating systems and apps. I find plain screen reader speech, without sounds and such, to be boring.
So, when I heard about the Be My Eyes Virtual Volunteer program, I was excited. I could finally learn what things look like. I could finally learn what apps and operating systems look like. I could send it pictures of my surroundings, and get detailed descriptions of them. I could send it pictures of my computer screen, and understand what’s there and how it’s laid out. I could even send it pictures from Facebook or Twitter, and get more than a bland description of the most important parts of the image.
I began trying the app, with saved pictures and screenshots. The AI, GPT4’s multimodal model, gave excelent descriptions. I finally learned what my old cat looks like. I learned what app interfaces like discord look like. I sent it screenshots of video games from Dropbox, and learned what some video game characters and locations look like.
Now, it’s not always perfect. Sometimes it imagines details that aren’t there. Sometimes it doesn’t get the text right in an image. If a Large Language Model is a blurry picture of the web, I’d rather have that than a blank canvas. I’d rather see a little than not at all. And that’s what these models give me. No, it’s not real site. I wouldn’t want to wait a good 30 seconds to get a description of each frame of my life. But it’s something. And it’s something that I’ve never had before.
Feeding the Beast
A lot of people will say that these models just harvest our data. They do. A lot of people will then say that I shouldn’t be feeding their Twitter posts, video games, interfaces, comic books, and book covers into the models. My only response to that is that if all these things were accessible to me, I wouldn’t have to feed them to the models. So if you don’t want your pictures in OpenAI’s next batch of training data, add descriptions to them. If you don’t want your video game pictures used in the next GPT model, make your game accessible. If you don’t want your book covers used in the next GPT model, add a description to them. That’s just all there is to it. I’m not giving up this new ability to understand visual stuff.