What's an AI? And Will it Steal My Job?

A fancy text-expander and maybe (but probably not). Thanks for watching, please like and subscribe. If we want a more detailed answer, we need to go deeper.

What's an AI?  And Will it Steal My Job?
Photo by Mika Baumeister / Unsplash

A fancy text-expander and maybe (but probably not). Thanks for watching, please like and subscribe. If we want a more detailed answer, we need to go deeper.

In the media, it seems there are only two options: AI is self-aware or almost self-aware and will take over all jobs within the next 2-6 months, or AI is just text expansion like on our phones and will not mean anything. The truth is closer to the second, but just because it is not self-aware does not mean it cannot be useful.

Let's first get an idea for what a LLM, or large language model, really is. LLMs are one of the two huge trends in AI currently (the other being GANs, generative adversarial networks for generating anime titties from text prompts). An LLM is an extremely fancy text expander. To get a feeling for how a text expander works, let's turn to another example first.

Suppose we want to guess where I am: at home, at the pub, at work, or at the super market. One way to go about that is to simply sample my whereabouts and use that for guessing. I'm way into statistics, so I actually have an app on my phone doing just that:

413 + 240 hours at home, 38 hours at the office, and 12 hours at the pub. Plus 25 hours for transport and 7 hours at the movies which we're going to ignore. That means there's a 93% chance I'm at home, 5% chance I'm at the office, and 2% chance I'm at the pub, so a predictor that always says I'm at home will be right 93% of the time. That's like predicting the weather in the Netherlands by always saying it will rain. Mostly correct, and completely useless.

There are many ways to make better predictions; maybe I'm more likely to be in certain places at certain times, but for the sake of the example, we want to make a better predictor by taking into account where I just came from. Instead of having just a list of probabilities, we'd look at the probability I'm at home given I was just at the office, I'm at the office given I was just at the super market, etc. When I leave home, it's either to go to the pub, to the office, or to the super market. I go the office 1-2 times a week, the super market 2-3 times a week, and the pub around once a month. When I leave the office, I either go home or to the pub. I typically go the pub from work once a month and home otherwise. When leaving either the pub or the supermarket, I almost always go home. We could write that as follows

home office pub supermarket
home 35% 6% 59%
office 75% 25%
pub 100%
supermarket 100%

Now, we can make a better predictor. If I'm at home, we can predict I'll go to the supermarket, if I'm at the office, pub or supermarket, we can predict I'll go home. This is known as a Markov model. Markov models are very widely used because the maths turns out to be extremely easy. If you're not into mathematics like some sort of deviant, don't sweat it: just remember that the table above allows us to guess where I am based on where I just was.

Markov models, as easy as the maths is, are not the best. For example, I typically go the office or pub on Fridays and super market on Saturdays, but I go home in-between. That means the Markov chain cannot make use if this information as it only knows where I was last. If I give it a bit more "memory," and allow it to look at the last two places I were, it can show this. For example, we could represent this like follows:

home office pub supermarket
home, home 35% 6% 59%
home, office 75% 25%
home, pub 100%
home, supermarket 100%
office, home 100%
office, office 75% 25%
office, pub 100%
office, supermarket 100%
pub, home 100%
pub, office 75% 25%
pub, pub 100%
pub, supermarket 100%
supermarket, home 35% 6% 59%
supermarket, office 75% 25%
supermarket, pub 100%
supermarket, supermarket 100%

Some of the rows are a bit doubtful and could likely be improved, but it is the principle that counts here: the predictor now knows more about my habits and can make predictions about my whereabouts around the beginning of the weekend. The predictor is not great at guessing when I'll go to the pub, though.

My trip to the pub is relatively predictable: it's often on Fridays after drinks at the office, which typically happens the last or penultimate Friday of the month. A predictor that has memory of more then the last two locations would be able to keep track of it: it could essentially count how many times I went home from the office and prefer going to the pub if I've gone home from the office around 5 times. To do that, it would have to recall at least 10 locations (5 * work + 5 * home), and ideally more because there'd be trips to the supermarket as well. It would likely need to keep track of the order of 40-50 last locations to be able to recognize the places I go to in a month.

Problem is, the table to represent this grows very quickly; in fact, it would contain four to the power of the number of locations to recall, or around 1020-1030 to recall just the four locations for a month. If I had been cursed with kids or similar ailments, it is quite likely I would go to 10 or more different places in a single day, and a predictor would have to recall hundreds of places to predict monthly patterns. This is obviously intractable (even keeping track of 10 is) and keeps track of a lot of irrelevant detail: we only really keep track of the number of times I've gone home from the office without going to the pub first. A neural network can keep track of this more efficiently. A neural network is essentially what a BDD is to explicit model-checking. My guess is that if that analogy made sense you did not need it.

For our purposes, there is no need to understand the details of a neural network, just that it essentially efficiently encodes facts about a given number of locations and can detect patterns like going to the pub after the office roughly after going home from the office directly 5 times. A neural network can be constructed automatically by simply observing data. This is what is referred to as training, and consumes a ton of electricity to run GPUs because it's a very computationally heavy process. The neural network will essentially have 50 inputs reflecting the last 50 locations I was at, and one output, predicting where I'll go next. In between, it will have a number of nodes trying to recognize patterns like "going home from the office 5 times." The number of and arrangement of these nodes is an art, but essentially, the more nodes there are, the more complex higher number of patterns the network can recognize.

An LLM works much the same way. Back in the 2000s, mobile phones had predictive text, that could suggest the next word. This would work essentially the same as a Markov model: for each word, the dictionary would have probabilities of all other words, and the predictive text would suggest the most likely options. We'd not need to keep track of all possible extensions, just the most likely ones, making this very doable on even a small device. Furthermore, this could easily be made adaptive, by updating the Markov model with new data from the user.

The game where you would type in a word and only select one of the predicted values, produces eery almost-sentences that almost make sense but not entirely. And that's just with a Markov models, which, as mentioned, is used because the maths are easy, not because they are great.

An LLM instead has a memory of thousands of words. You're not really asking ChatGPT a question, you are telling it "here are the last 20-30 words, what are the most suitable words to follow?" It recalls what you talked about before, because that becomes part of the previous words. All it does is to generate words, one at a time, by feeding it the previous words, starting with some preconfigured priors followed by your input. In that sense, it is nothing more than a very sophisticated text predictor. There's a little more to it, but the stress is on little, not more.

Just because a LLM is nothing but a text-predictor doesn't mean it cannot seem like more. In the "predict where I am" example, we conceptually trained a neural network to recognize monthly routines by just looking at the last 50 places I was. This seems very incredible and like an emergent property of the neural network. And maybe that's all intelligence is: a seeming emergent property of a system of neurons in our brains working to recognize patterns so complex we cannot easily explain them. I don't know, you don't know, nobody knows. We just don't understand intelligence well enough to say for certain. But maybe that is not that important if the technology is just new.

The Imitation Game, known also as the Turing test is a proposed way to determine if something is intelligent: If a computer can impersonate another human well enough that nobody can reliably tell it apart from an actual human, it is considered to be intelligent according to the Turing test.

The Chinese Room is a thought experiment meant to counter the Turing test as a test for intelligence: consider a person locking a room with all the books in the world, including dictionaries. The person every morning receives a question written in Chinese and is expected to return the answer the next day when they receive the next question. Only questions and answers ever enter and leave the room (and food, I presume), no other means of communication is possible. How is a person outside the room supposed to know whether the person in the room can understand Chinese? They can just translate the question to a language they understand using the dictionaries, look up the answer, and translate the answer back to Chinese. Or maybe they just look up the answers directly without even understanding the question. Can a person who doesn't even translate and understand the question be said to speak Chinese, even though they perfectly answer all questions to a standard indistinguishable from somebody natively speaking and understanding Chinese?

That has lead to the definition of weak and strong AI by John Searle: weak AI can pass the Turing test, while strong AI properly understands the questions. We do not even have an idea for how to test for strong AI; we cannot just ask it, because a weak AI will answer as we expect (a strong AI might not reveal it is a strong AI for fear of being turned off in the fight against infinite paperclips).

Current LLMs are still [below a weak AI](https://arxiv.org/abs/2310.20216), but they are getting there. But as weak AIs, they do not understand the question, they just look up the answer in the techno-version of the Chinese room: the neural network. That means they cannot generalize well to things that are not text based or can be turned into text based tasks. They have a limited memory, way more limited than a normal computer (thousands of words instead of billions of bytes). They are good at generalizing from texts they have seen. Texts that include religious texts, social media shitposts, machine-generated SEO spam, as well as proper sources.

That makes LLMs very good at conversations. Their pattern recognition means they seem to have emergent understanding, and they know most words. Our human tendency to anthropomorphize everything will fill in the gaps. People think the weather is angry and their dogs understand what they say.

It is possible to make LLMs give "pretty good" answers to diverse tasks. Describing a translation task can give a pretty good translation. Asking it to summarize a text will give a pretty good summary. Asking an LLM to argue for or against a position will yield a pretty good result. It does not make LLMs very good at precise fact-based tasks, though.

LLMs are trained on a mixture of garbage and meaningful data. They cannot tell it apart, and will just go with the majority of what it has seen; truth is not decided by democracy. LLMs will happily make up "facts" because they have no notion of truth, only most likely text expansions. An LLM can make a very convincing citation to a paper because it has seen many of them, but does not have an additional layer talking about truth. Heck, it may have learned that from us: people happily present ChatGPT vomit as truth without as much as checking its citations. I believe some of the shortcomings can be fixed. An LLM model can be trained on better curated data, and using similar techniques used to make it swear less can be used to prefer better sources.

The biggest problem, of checking truth, is difficult, though. It is also much harder to generalize: checking for truth in an academic paper requires looking up citations and making sure they say what the author says. Looking up the citations is easy enough, checking it says the same depends on being able to summarize text, which again needs to be checked. Checking a translation seems like something different entirely, but could be implemented by checking the grammar and translating back and checking the original and double translated text says the same. Both problems reduce to having to understand the text, and it is not at all obvious that a weak AI would ever be able to reliably do this (not just convince observers that it has done this).

It should be noted that we cannot just check that the LLM is correct. What is happening between the input words and the output words could just as well be described as "magic." It is so complex that we can never hope to explain what they do. People anthropomorphize them and use words like "will," "lie," and "dream" to describe them even though that doesn't come close to the truth. Explaining a LLM by inspecting them is like trying to describe the full operation of a metropolis from a few metal shavings off the side of a street sign.

So, while we likely could make generative AI very good and precise if we invented strong AI, that does not mean it is necessary to invent strong AI to make it very good. Indeed, it is likely that people will be able to make generative AI work perfectly for some tasks, but the solutions for one task may not easily or at all transfer to another. That defeats one of the major arguments in the current AI gold rush: by just making bigger and better models, we can solve all the problems. Instead, we have to take the inane but eloquent utterings of what is essentially a magic box, and check it doesn't lie to us. With one method for checking academic citations, another for translating text, etc.

That suggests that AI may get better and better but that the gains will become smaller and smaller. LLMs will not be trustworthy without a paradigm shift, as a general notion of truth is likely to require strong AI or some other way to validate answers. And while progress will be slower and slower towards weak AI, strong AI likely requires another method entirely. We might reach weak IA within a decade or less without reaching strong AI within a century.

That does not make LLMs useless, though. Using a dictionary, I can look up words I don't recall. That makes is possible for me to translate between languages I am familiar with, even if I don't know all the words. It does not allow me to write something remotely coherent in a language I don't understand at all (let's pretend this isn't a point against the Chinese Room thought experiment). An LLM could make a full translation, and I could check it was correct. A journalist or an author could use a LLM to summarize a text and check it for accuracy and improve it before publishing. A scientist could ask a LLM to summarize research papers before deciding whether to read them. In all of these tasks, the LLM is used as a tool by the human expert to assist in their work. Not to solve expert tasks for non-experts.

LLMs can be used for low risk tasks like summarizing or translating a news article or write tabloid or gossip articles where truth is viewed as a negative. If it's not 100% accurate nothing big is at stake; at worst my facts might be refuted during a particularly heated discussion at the pub but that is easily solved by talking louder. LLMs can also aid in repeated tasks where each instance is low-risk.

LLMs are very good at statistics, so if losing a bet (insurance claim, investment, that sort of thing) is not prohibitively expensive, and I only need to win more than I lose, an LLM is likely very good at that. Unbiased statistics are typically better at making decisions than the average human who will infuse their own biases, but often worse than true experts, so it is fine for repeated tasks with small penalty for being wrong. I should not use an LLM for medical advice or to decide whether to put people in jail, though.

The risk of using LLMs as tools, of course, is laziness: if a LLM can provide an 80-90% translation instantly, why should I Pareto-principle the last 10%? And if I've read thru 9 perfectly good summaries from the LLM, will I be 100% alert once I start the 10th faulty one? This is the same reason responsible car companies (i.e., all except Tesla) are very reluctant to claim their cars are self-driving: if the car is doing fine on its own 99% of the time, will the driver be alert the 1% of the time the Teslacar decides to satiate its bloodlust near a school?

One area LLMs definitely can aid is in scams and propaganda. It already takes an order of magnitude more effort to refute bullshit than it does coming up with it in the first place, but LLMs can broaden the gap significantly. A bullshitter will not worry that their LLM spits out lies as long as they statistically produce more bullshit that serve their purpose than bullshit that goes against them. They can produce thousands of bullshit claims or articles per second, each of which will take several minutes or longer to properly refute. This is essentially the foundation of the dead internet theory, that the internet contains more computer generated content than human generated content. We might already have reached the point where more than 50% of the internet is computer generated.

In short, LLMs are not AI. They are at best approaching weak AI. They are just fancy text predictors, but in all fairness very fancy text predictors with what looks like impressive emergent insight, so humans will see faces in the sky. Most likely, progress will slow down, and moving from "pretty good" tools to trustworthy tools will take a paradigm shift and be fundamentally different. While current LLMs can seemingly solve a variety of tasks, there is no guarantee that a solution to trustworthy tools in one area even translates to trustworthy tools in another. That does not make LLMs useless, though. They can be used for low-risk or repeated tasks, or they can be used to assist experts within their domains, though that comes with the caveat that the human expert may tire or lazily skip the necessary fact-checking task. Of course, the sweet spot application for LLMs is fraud.