Can Transformers Solve Everything?

Harys Dalvi

October 2024


Transformers are best known for their applications in natural language processing. They were originally designed for translating between languages,[] and are now most famous for their use in large language models like ChatGPT (generative pretrained transformer).

But since their introduction, transformers have been applied to ever more tasks, with great results. These include image recognition,[] reinforcement learning,[] and even weather prediction.[]

Even the seemingly specific task of language generation with transformers has a number of surprises. Large language models have emergent properties that feel more intelligent than just predicting the next word. For example, they may know various facts about the world, or replicate nuances of a person's style of speech.

The success of transformers has made some people ask the question of whether transformers can do everything. If transformers generalize to so many tasks, is there any reason not to use a transformer?

Clearly, there is still a case for other machine learning models and, as is often forgotten these days, non-machine learning models and human intellect. But transformers do have a number of unique properties, and have shown incredible results so far. There is also a considerable mathematical and empirical basis for why we should expect this success to continue.

The real question, then, isn't “can transformers solve everything?” Instead, it's “why shouldn't they solve everything?” There are a few reasons why not:

In this article we'll look at all these constrains, including a cool demo comparing transformers to the classical fast Fourier transform algorithm for time series.

Are transformers the one architecture to rule them all? A depiction of J. R. R. Tolkien's One Ring from Peter Jackson's films. Image source: Peter J. Yost, CC BY-SA 4.0.

This Isn't the First Time

On Kaggle, XGBoost tends to win competitions on structured data while various kinds of neural networks dominate unstructured data competitions.[] For quite a while now, people have been asking if model X is the model to end all models, a model that can solve everything.

Surprisingly, these wild claims are actually backed up by solid math. Most striking is the universal approximation theorem, which states that neural networks can approximate any continuous function to any degree of accuracy: you just need enough neurons and nonlinear activation functions. In fact, you can even do this in a single layer with enough neurons.[] This idea goes back at least to 1989.[]

If we go back even further, we find the Church-Turing thesis from the 1930s. Informally, this is the idea that anything that can be computed can also be computed by a Turing machine.[]

This holds up for transformers too. It turns out that subject to some constraints, transformers are theoretically capable of approximating any sequence-to-sequence function.[] This means that with a Turing machine in Python and a neural network in PyTorch you can theoretically do any possible computation.

If that's the case, why haven't we computed everything yet?

This goes back to the constraints: computational constraints (not enough computational power), data constraints (not enough high quality data), and algorithmic constraints (using the wrong algorithm for a problem).

Scaling

Data, compute, and energy constraints on transformers, while not the same, are intimately tied together. Even if we have enough data to train a transformer model on a task, we might not have enough compute or energy to complete the training. All of these fall under the problem of scaling.

As discussed before, theoretically transformers can do many, many things. But getting transformers to do these things in practice generally requires scale. Often that's too expensive, and it's a better idea to use simpler models instead. Let's look at the future of scaling transformers to see exactly when transformers are a better option.

Large Companies

ChatGPT isn't a crazy algorithmic jump over GPT-2, which produced results like this. The difference is largely a difference of scale: more layers, more parameters, and more training.

Therefore, some argue that with enough scaling of transformers, we will reach a general transformer model that can do just about any thinking a human can: this would be artificial general intelligence (AGI). This could greatly reduce the need for other algorithms and, concerningly, possibly humans as well. We just need more data, more compute, and more time; or so the argument goes.

As for data, we probably still have a while to go. It's estimated that large language models (LLMs) have only trained on 1/30 of all data on the web, which is massive, but leaves a lot of room to expand. Additionally, it turns out that training LLMs on data that they themselves generated (synthetic data) can improve performance. So by letting LLMs improve themselves, we might have even more data than these figures would suggest.[]

Compute itself also likely won't be a limiting factor for the largest companies. GPUs currently take up only a small fraction of all wafer production at TSMC (Taiwan Semiconductor Manufacturing Company), meaning we have lots of raw materials left over.[] As demand increases, GPU production can be scaled up with it.

Instead, energy could be the bottleneck. To scale large transformer models 5000x, as is projected by 2030, the power demand just for a training run is projected to be 6 gigawatts. This is both extremely expensive and bad for the environment. Companies are looking into nuclear power to get around this, but there are various obstacles, especially regulatory concerns.[]

Even when not training, running AI models will require significant power. Technologies such as Chain-of-Thought (CoT) reasoning in the new OpenAI o1 model could increase inference costs even further.

Smaller Companies

Smaller companies likely won't have the resources to train LLMs from scratch, and they also won't need them. If a small company needs to access a powerful transformer model from a large company, they can either pay for queries or host an open source model locally.

So far, these large transformer models have mostly been LLMs. But in the future, if transformers really can solve everything, we might see similar models for other kinds of transformers.

Some domain-specific transformers might also be less intensive to train, so startups can build and sell their own in-house. Already we see companies like Atmo using deep learning for weather forecasting. While transformers for language modeling are very intensive and slow, Atmo's model is actually faster and more accurate than the corresponding atmospheric physics simulations. If transformers can generalize to a wide range of domains outside language, we might see startups not only using large companies' LLMs, but also building their own niche and innovative transformers.

Domain-Specific Applications

Transformers can even do things we don't usually do with machine learning, like add and subtract numbers[] and implement hashing algorithms.[] If we use transformers to achieve some sort of AGI (whatever that means), then naturally we could do a wide range of tasks like these. Would this render other domain-specific models obsolete?

The transformer state of the art in arithmetic seems to be 99% accuracy on 100-digit numbers.[] But there's an even better algorithm for adding and subtracting numbers. It takes minimal compute, requires no training data, can work with any size numbers, and has 100% accuracy. It's called... adding digits and carrying the extras.

There are many domains where speed, interpretability, and 100% accuracy really matter. A transformer will always fail here, even if technically it could produce a decent result. These include arithmetic, cryptography, and mathematical proof verification.

So it seems these areas are safe from the influence of transformers, right?

Not quite. It's true that transformer models are only the best tool for the job in a certain subset of cases. However, one of these cases may be the task of determining when and where to carry out more traditional calculations! For example, transformers are much better than other models at coding, so in theory they could simply write programs to solve tasks that transformers themselves are ill-suited for.

With the correct setup, they can do this in a collaborative feedback loop with more traditional tools. Consider Google DeepMind's AlphaProof. This system combines a pretrained language model with Lean, a proof assistant that can verify mathematical proofs. So rather than just stochastically spitting out a proof, the language model can make sure its proof is correct and adjust as necessary. AlphaProof won a silver medal on the International Mathematical Olympiad, one of the most difficult and prestigious mathematics competitions.

Algorithmic Constraints

What do we mean by algorithmic constraints? In general, this is the idea that a transformer trained on some data might not be the best algorithm we have to solve a given problem.

In fact, this might be a computational constraint in disguise: maybe a transformer can technically solve a problem, but the amount of data and compute required is far more than with a more specialized algorithm. Let's take a look at one such case.

Demonstration: Bad Algorithm means More Compute

We know transformers are expensive. But how much more expensive is a transformer, really? Let's test this out by simulating a noisy time series and using two methods to pick out the signal: the fast Fourier transform (FFT), a well-known tool for this job, and a transformer model.

We'll use 5000 total data points of the signal $$\sin(x) + \frac{1}{5}\cos\left(\tfrac{11}{13}x\right) + \frac{1}{9}\sin\left(\tfrac{17}{37}x-\tfrac{\pi}{4}\right),$$ plus some Gaussian noise with standard deviation \(\frac{1}{7}\). We'll split this into 90% training data and 10% test data.

Our transformer will be decoder-only, with an input size of 25 data points, a hidden dimension of 8, a feedforward dimension of 4, 1 attention head, and 1 layer. We'll train for 1 epoch with a batch size of 128, using the Adam optimizer with a learning rate of 0.1. Writing that out feels ridiculous for a simple time series task like extracting a signal, but here we are.

Wow... the FFT did ok, but the transformer is absolutely horrendous! Let's try decreasing the learning rate to 0.01?

Getting a little better. Let's try training for 20 epochs instead of 1.

The transformer's predicted frequency and amplitude are a little too low, and it's not as smooth as the FFT solution, but now at least it's got the spirit. As for quantitative performance, the FFT had a root mean square error of 0.24, while the transformer had 0.88. The FFT is doing much better, especially considering that due to our 1/7 random noise, we wouldn't expect to get below 0.14.

Of course, the FFT also wins on training time. The FFT took just 0.11 seconds to compute on all training data, while the transformer took 18.9 seconds to train all 20 epochs. All that for worse performance.

The FFT has another benefit too: interpretability. We can look inside and see the amplitudes of all the frequencies that the FFT picked up.

If we look back to the real equation, this is pretty spot-on.

In the real world you might not have a dataset like this where a traditional non-machine learning model is obviously the perfect choice rather than machine learning or a transformer. Machine learning shines where we have the data, but we don't even know where to start when it comes to coding an algorithm. And neural networks like transformers shine even more in cases of unstructured data where we don't have good ideas to compute our own features. But hopefully this toy example demonstrates the universal approximation theorem, as well as why it isn't always a good guide in practice. To match or exceed the FFT performance with a transformer, we would need a lot more data and compute. Just because transformers can do everything doesn't mean they should.

Transformers aren't the final frontier either. There are exciting architectures like Mamba on the horizon that could one day replace transformers.[] Most likely, they will still not replace traditional methods for similar reasons to transformers.

Bad Data means Bad Algorithm

Just as algorithmic constraints are sometimes compute constraints in disguise, we can also look at some algorithmic constraints as being data constraints in disguise.

What I mean is this: if we want to solve a problem with transformers, or any sort of machine learning, we usually start with a dataset. It's possible that no matter how good we get at inference in the dataset, even if we have a cool machine learning algorithm at the end of it, maybe we didn't actually solve the problem we had to begin with. In other words, winning a Kaggle competition for cancer detection does not mean you cured cancer.

Going with the Kaggle example, why is this the case? The answer, of course, is that detecting cancer is just one small part of curing cancer. Because of the targets and labels of our dataset, our machine learning algorithm will always be an algorithm designed only to detect cancer, no matter how much we refine it with more data.

This is a problem not just with transformers but with all of machine learning. There's just one exception: if we were to train a general machine learning algorithm, that can take any input and produce an appropriate output, this would not apply. Our algorithm would be able to detect cancer and cure it.

The question is whether this applies to LLMs. Is predicting the next word really a proxy for intelligence and creative thinking, even up to the level of curing cancer? We've already seen that the simple next-word-prediction task seems to capture some amount of intelligence. But it remains to be seen how far this will go.

The internet does contain some data that requires reasoning: you might see a sentence like “Socrates is a man. All men are mortal. Therefore...” and have the LLM trained to continue. Human fine-tuning can refine this capability further than basic syllogisms. But maybe solving advanced reasoning through next-word-prediction will turn out to be intractable, and something more is required.

Conclusion

According to the universal approximation theorem, neural networks can approximate any continuous function to any degree of accuracy. This means that in theory, yes, transformers can solve everything time series. They might even work in other areas, like images.

However, they often come at a large computational cost, and might require more data than we have access to. Even in cases where a transformer could work, a traditional model often comes with both better performance and lower cost.

Transformers are still a powerful model. While they are mostly associated with LLMs, startups could do to build more domain-specific transformers as well.

In general, given enough data, transformers or other neural networks will eventually do a good job matching their dataset. But when training a transformer, or any machine learning model, sometimes good performance on the dataset isn't really a success. You may have to ask whether solving this dataset is really solving a problem for people. Just like solving a breast cancer dataset on Kaggle won't end breast cancer, it remains to be seen whether solving the next-word-prediction task will solve AGI. There are arguments both for and against the idea that it will.

Even if next-word-prediction doesn't solve AGI, it's at least a useful tool for coding. When combined with other systems, transformers can solve tasks more robustly, even mathematical proofs that require perfect rigor. But the use of a more traditional system is critical here.

So yes, transformers can solve everything. But they probably shouldn't.

References

The GitHub for the transformers vs FFT demo is at crackalamoo/blog-demos.

  1. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit & Neil Houlsby, Google Brain, 2020) ^
  2. Decision Transformer: Reinforcement Learning via Sequence Modeling (Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas & Igor Mordatch, 2021) ^
  3. Scaling transformer neural networks for skillful and reliable medium-range weather forecasting (Tung Nguyen, Rohan Shah, Hritik Bansal, Troy Arcomano, Sandeep Madireddy, Romit Maulik, Veerabhadra Kotamarthi, Ian Foster & Aditya Grover, 2023) ^
  4. Attention Is All You Need (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser & Illia Polosukhin, 2017) ^
  5. Lessons from 2 Million Machine Learning Models on Kaggle (Vasyl Harasymiv, KDnuggets, 2015) ^
  6. A visual proof that neural nets can compute any function (Michael A. Nielsen, Neural Networks and Deep Learning, 2019) ^
  7. Multilayer feedforward networks are universal approximators (Kurt Hornik, Maxwell Stinchcombe & Halbert White, Neural Networks, 1989) ^
  8. The Church-Turing Thesis (Stanford Encyclopedia of Philosophy, 1997–2023) ^
  9. Are Transformers universal approximators of sequence-to-sequence functions? (Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi & Sanjiv Kumar, 2020) ^
  10. Transformers Can Do Arithmetic with the Right Embeddings (Sean McLeish, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild & Tom Goldstein, 2024) ^
  11. Implementing an SHA transformer by hand (Andrew Gritsevskiy) ^
  12. Can AI Scaling Continue Through 2030? (Jaime Sevilla, Tamay Besiroglu, Ben Cottier, Josh You, Edu Roldán, Pablo Villalobos & Ege Erdil, Epoch AI, 2024) ^
  13. Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Albert Gu & Tri Dao, 2023) ^