Intro


Nowadays, people are familiarized with large model, such as LLMs, in their daily lives. Discussions about which model should own the title of “strongest” are prevalent. AI startup enterprises are striving to optimize their models by any means necessary, including scaling model’s parameter count. As diving into this field of deep learning for more than one year, I have grown weary of this competition focused on model’s parameter count (Hereinafter referred to as NoP, Numbers of Parameter). Models like Llama 405b are incredibly powerful, yet also incredibly slow and uninspiring, and I had never encountered a model could astonish me in the same way as GPT since. In a lecture at Shanghai Jiao Tong University, Li Mu, the Chief Deep Learning Scientist of Amazon, clarified that NoP of model will eventually plateau due to hardware limitations, capping at around 500b. However, there will never be a lack of ways to improve. One practical approach, as this essay will discuss, is Multimodality.

What Multimodality brings?


I always treat LLM politely, jokingly telling my friends that if AI is out of control, he might treat me better for how I have behaved towards him. Of course, that’s not true. I don not believe there will ever be a day when AI try to enslave humans. Consider how LLM process information, a plausible explanation is: model learn connections and logics through dialogue. So when model receive input, it could generate output based on its cognition. Obviously, the training data contains numerous of dialogues, responses tend to be better when corresponding input is more polite. Therefore, my respect is reasonable: model may give a more effective output when the input is more respectful.
I am not suggesting everyone to give model full respect to receive an improved answer. In fact, in most cases, variations in input may not even make a noticeable difference. What I really want to show is, models behave more similar to human than we expected in some aspects. He Kai Ming put forward Residual Network drew lessons from how human brain handle rantia input, that is, human structure had really benificial Deep Learning Science a lot. Actually, the concept “neuron network” itself is an imitation of human brain as its name claimed. In that case, model’s human-like behaviors is comprehensible, even predictable. Some may object that the logic “ Similar structure cause similar behaviors” is silly, if you are thinking so, you are right. However, Modern Deep Learning is an area which fulfilled with unsure and Deep Learning Sciences are Alchemists, Lots of existing widely untilzed theories even didn’t own a reliable solid theoretical foundation.