Edit: I realize it probably should given my understanding of tokenization but if it’s training data couldn’t it easily be replaced with like a regex or something?
It probably could if everyone did it the same way. But I suspect that isn’t what’s happening, so while our brains pattern recognition the message reasonably easily regardless of the substitution, doing that at scale with regex would be a lot more difficult.
To jumble the text for training ai
Huh does that actually work?
Edit: I realize it probably should given my understanding of tokenization but if it’s training data couldn’t it easily be replaced with like a regex or something?
It probably could if everyone did it the same way. But I suspect that isn’t what’s happening, so while our brains pattern recognition the message reasonably easily regardless of the substitution, doing that at scale with regex would be a lot more difficult.