Simon Willison
👤 PersonAppearances Over Time
Podcast Appearances
um there is at least one significant model now where the training data is at least open as in you can download a copy of the training data it includes stuff from the common crawl so it's includes a bunch of copyrighted websites that they've scraped but um but that has but there is at least one model now that has completely transparent licensing um transparent transparency on the training data itself which is it's good you know um
um there is at least one significant model now where the training data is at least open as in you can download a copy of the training data it includes stuff from the common crawl so it's includes a bunch of copyrighted websites that they've scraped but um but that has but there is at least one model now that has completely transparent licensing um transparent transparency on the training data itself which is it's good you know um
One of the other things that I've been tracking is, I love this idea of a vegan model, an LLM, which really was trained entirely on openly licensed material, such that all of the holdouts on ethical grounds over the training, which is a position I fully respect. If you're going to look at these things and say, I'm not using them, I don't agree with the ethics of how they were trained,
One of the other things that I've been tracking is, I love this idea of a vegan model, an LLM, which really was trained entirely on openly licensed material, such that all of the holdouts on ethical grounds over the training, which is a position I fully respect. If you're going to look at these things and say, I'm not using them, I don't agree with the ethics of how they were trained,
That's a perfectly rational decision for you to make. I want those people to be able to use this technology. So actually, one of my potential guesses for the next year was I think we will get to see a vegan model released. Somebody will put out an openly licensed model that was trained entirely on licensed or public domain work. I think when that happens, it will be a complete flop.
That's a perfectly rational decision for you to make. I want those people to be able to use this technology. So actually, one of my potential guesses for the next year was I think we will get to see a vegan model released. Somebody will put out an openly licensed model that was trained entirely on licensed or public domain work. I think when that happens, it will be a complete flop.
I think what will happen is it won't be as good as the... It'll be notably not as useful. But more importantly, I think a lot of the holdouts will reject it because we've already seen this. People saying, no, it's got GPL code in it. The GPL says that you have to attribute the... There's attribution requirements not being met, which is entirely true. That is, again, a rational position to take.
I think what will happen is it won't be as good as the... It'll be notably not as useful. But more importantly, I think a lot of the holdouts will reject it because we've already seen this. People saying, no, it's got GPL code in it. The GPL says that you have to attribute the... There's attribution requirements not being met, which is entirely true. That is, again, a rational position to take.
But I think that... It's both true and it makes sense to me, but it's also a case of moving the goalposts. So I think what would happen with a vegan model is the people who it was aimed at will find reasons not to use it. And I'm not going to say those are bad reasons, but I think that will happen.
But I think that... It's both true and it makes sense to me, but it's also a case of moving the goalposts. So I think what would happen with a vegan model is the people who it was aimed at will find reasons not to use it. And I'm not going to say those are bad reasons, but I think that will happen.
In the meantime, it's just not going to be very good because it won't know anything about modern culture or anything where it would have had to ripped off a newspaper article to learn about something that happened.
In the meantime, it's just not going to be very good because it won't know anything about modern culture or anything where it would have had to ripped off a newspaper article to learn about something that happened.
I'm very sold on that with one sort of edge case. And that's the thing about writing. The most tedious part of learning is learning to write essays. That's the thing that people cheat on. And that's the thing where I don't see how you learn those writing skills without the miserable slog, without the tedium.
I'm very sold on that with one sort of edge case. And that's the thing about writing. The most tedious part of learning is learning to write essays. That's the thing that people cheat on. And that's the thing where I don't see how you learn those writing skills without the miserable slog, without the tedium.
And so that's the one part of education I'm most nervous about is how do people learn the tedious slog of writing when they've got this tempting devil on their shoulder that will just write it for them.
And so that's the one part of education I'm most nervous about is how do people learn the tedious slog of writing when they've got this tempting devil on their shoulder that will just write it for them.
I will say one thing about LLMs for feedback. They can't do spell checking. I only noticed this recently. Claude, amazing model, it can't spot spelling mistakes. If I ask it for spell checking, it hallucinates words that I didn't misspell, and it misses the words that I did. And it's because of the tokenization, presumably. But that was a bit of a surprise. It's like, it's a language model.
I will say one thing about LLMs for feedback. They can't do spell checking. I only noticed this recently. Claude, amazing model, it can't spot spelling mistakes. If I ask it for spell checking, it hallucinates words that I didn't misspell, and it misses the words that I did. And it's because of the tokenization, presumably. But that was a bit of a surprise. It's like, it's a language model.
You would have thought that spelling, spell checking would work. Anything they output is spelled correctly, but they actually have difficulty spelling spelling mistakes, which I thought was interesting.
You would have thought that spelling, spell checking would work. Anything they output is spelled correctly, but they actually have difficulty spelling spelling mistakes, which I thought was interesting.