Evolved to be able to recognize the value of human emotions? Today's free GPT-4o

"Sometimes I observe people, and I try to put myself in their shoes, imagining how deeply they love others or what kind of heartbreaks they have experienced." — The movie plot of the science fiction film "Her" seems to be on the verge of becoming a reality.

01

The Explosive OpenAI Launch Event

A decade ago, a movie titled "Her" won the Best Original Screenplay at the 86th Academy Awards in 2014. It tells the story of a lonely writer who falls in love with an AI voice assistant on his phone. In the film, the AI named Samantha has a husky, sexy voice. She is witty, humorous, and empathetic, accompanying the male protagonist at all times and gradually becoming an indispensable part of his life.

Fast forward to today, ten years later, at OpenAI's spring launch event, with the arrival of the GPT-40 new model, Samantha has officially become a reality. This upgraded version of ChatGPT can not only chat and converse naturally with you like Samantha but can also observe and understand your emotions through the phone's camera.

At 1 a.m. Beijing time on Tuesday, OpenAI unveiled its latest multimodal large model, GPT-4o (where "o" stands for Omni, meaning all-encompassing). This "all-encompassing" large model has the capability to process text, audio, and images. Compared to previous generations, it has added voice functionality and operates at a faster speed.

Advertisement

The OpenAI launch event was very brief, lasting only 26 minutes, but the evolution of ChatGPT is truly awe-inspiring.

Although GPT-5 did not arrive as expected, OpenAI's latest flagship large model, GPT-40, has already caused a "qualitative change" in human-computer interaction. According to the official introduction, the "o" in 40 stands for "omni (all-encompassing)," signifying that this version of GPT has fully integrated capabilities in text, vision, and audio. It can accept any combination of inputs and outputs. Its shortest audio input response time is 232 milliseconds, and the average is 320 milliseconds, which has now reached the reaction speed of humans in conversation.It is understood that GPT-4o represents a step towards more natural human-computer interaction. It can accept a combination of text, audio, and images as input and generate any combination of text, audio, and images as output. "Compared to existing models, GPT-4o is particularly excellent in understanding images and audio."

Before GPT-4o, when users conversed with ChatGPT in voice mode, the average latency for GPT-3.5 was 2.8 seconds, and for GPT-4 it was 5.4 seconds. Audio input would also lose a significant amount of information due to the processing method, preventing GPT-4 from directly observing intonation, the speaker, and background noise, and it could not output laughter, singing, or express emotions.

In contrast, GPT-4o can respond to audio input within 232 milliseconds, which is close to the reaction time of humans in conversation. In a recorded video demonstration, two executives showed that the robot could understand the meaning of "anxiety" from rapid breathing sounds and guide him to take deep breaths. It can also change its tone according to user requests.

In terms of audio, GPT-4o's automatic speech recognition (ASR) also performs better than OpenAI's speech recognition model Whisper (the lower, the better).

More importantly, GPT-4o's visual understanding capabilities have achieved an overwhelming victory in relevant benchmarks.

02

GPT-4o with Emotional Understanding CapabilitiesIn contrast to the current large models' "involution" in terms of parameters and performance, GPT-4o has become the focus of attention in the global tech community, primarily due to its "emotional understanding" capabilities.

GPT-4o has taken a significant step in understanding human communication, allowing users to converse with it in a manner close to natural speech. It comes with almost all the tendencies found in the real world, such as interrupting, understanding tone, and even realizing when it has made a mistake.

During the first live demonstration, the host asked GPT-4o to provide feedback on his breathing technique. He took a deep breath into his phone, and ChatGPT responded humorously, saying, "You're not a vacuum cleaner." It suggested using a slower technique, demonstrating its ability to understand and respond to the nuances of human interaction.

In addition to having a sense of humor, ChatGPT can also change the tone of its responses, conveying "thoughts" with different intonations. Just like in human conversation, you can interrupt its dialogue and correct it, prompting it to react or stop speaking. You can even request it to speak in a certain tone, style, or robot voice.

Moreover, it can even provide translation services. In the live demonstration, two speakers on stage, one speaking English and the other Italian, conversed through the translation of Chat GPT-4o. It could quickly translate Italian into English and then seamlessly translate the English response back into Italian.

OpenAI claims that GPT-4o can also detect human emotions. In the demonstration, Zoph held his phone up to his face and asked ChatGPT to describe what he looked like. Initially, GPT referred to a photo he had previously shared and identified him as a "wooden surface." Upon a second attempt, GPT provided a better answer.

GPT noticed the smile on Zoph's face and said to him, "You look very happy, beaming with joy." Some comments suggest that this demonstration shows that ChatGPT can read human emotions, although it still has some difficulty in doing so.Not only does it perceive human tone and state, as well as express humor that represents human "emotional intelligence," but ChatGPT also demonstrates the ability to have conversations that can be interrupted at any time and followed up on immediately. In summary, in a conversation with OpenAI's research head Mark Chen, it has almost become a real human, with no awkward pauses or points of misunderstanding.

In addition to real-time conversation, Mark Chen also guided ChatGPT to show its ability to render vocal tones in storytelling situations. He asked ChatGPT to tell a bedtime story with the theme "Robots in Love" and twice requested it to read in a more "dramatic" way. Users can clearly feel the progressive tone, which is a clear expression of emotional capability.

Executives at OpenAI stated that GPT-4o can interact with codebases and demonstrated its ability to draw conclusions from a global temperature chart based on some data analysis graphs and the content it saw. OpenAI announced that the text and image input features of ChatGPT based on GPT-4o will be launched this Monday, with voice and video options to be rolled out in the coming weeks.

GPT-4o can be seen as an important advancement in the naturalization of human-computer interaction. Its application potential and operational boundaries are still in the initial stages and require further exploration and experimentation.

OpenAI also demonstrated some applications of GPT-4o in everyday scenarios, from entertainment to education, from social interaction to professional assistance, showing that it can assist humans in various aspects. For example, improving the quality of life for visually impaired individuals, real-time translation, helping to learn new languages, assisting in online meetings or interviews, interacting with pets, playing games, and more.

03

Multimodal Intellectual Performance

Beyond emotion, GPT-4o is a multimodal product.

OpenAI's CEO Sam Altman did not make a physical appearance, but he supported the newly launched GPT-4o "behind the scenes" by posting online, calling it "intelligent, fast, native multimodal, and the best model ever."Apparently, what Sam Altman referred to as "native multimodality" indicates the integration of text, image, and speech functionalities. He also posted that developers who wish to experiment with GPT-4o will have access to an API, enabling them to build applications with this new model starting from Monday, at a price half that of GPT-4 Turbo, yet with twice the speed.

OpenAI stated, "We have trained a new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Since GPT-4o is our first model to combine all these modalities, we are still in the early stages of exploring the capabilities and limitations of this model."

In addition to the features highlighted in the live broadcast, in OpenAI's technical documentation, we see that the capabilities list for GPT-4o also includes 3D capabilities, image poetry abilities, and the ability to transform cartoon photos, among others.

Although multimodal AI is still in its infancy, several models have begun to emerge. Google's Gemini Ultra model has surpassed GPT-4 in the Multidisciplinary Multimodal Understanding and Reasoning (MMMU) benchmark, demonstrating the potential of multimodal models. To maintain competitiveness, more developers of large language models will follow suit in developing multimodal capabilities.

Furthermore, multimodal AI is expected to unlock new business opportunities, such as the application of Artera in healthcare, Google's integration of Gemini into search, Ghost Autonomy's exploration in the field of autonomous driving, and Meta's application of it in consumer devices like smart glasses.

This year, global AI large model platforms have continued to iterate and upgrade, including overseas models like Sora, Llama3, and domestic models like Kimi, Kunlun Tiangong AI, and Stepping Stars. Huatai Securities pointed out that with the upgrade of model capabilities, 2C applications are expected to see accelerated development. The core issues for 2C applications to address are product performance and user willingness to pay. With the optimization of the underlying foundational model capabilities, the usage effects of 2C applications have significantly improved, and the application modalities are rapidly expanding.

Multimodality is seen as one of the important trends in the AIGC industry for 2024. The "China AIGC Application Panorama Report" published by Quantum Bits shows that the AIGC (Generative Artificial Intelligence) application market in China will reach 20 billion yuan in 2024, and a trillion-yuan scale by 2030, with an average annual compound growth rate exceeding 30% from 2024 to 2028.When Can We Use GPT-4o?

OpenAI will start rolling out the text and image capabilities of GPT-4o today, emphasizing that free users of ChatGPT will also have access. Prior to this, free users only had access to GPT-3.5, while the GPT-4 model was targeted at paying customers.

According to OpenAI, paying customers will receive up to five times the message capacity limit. Once free users have used up their allotted number of messages, ChatGPT will automatically switch to GPT-3.5.

Currently, the GPT-4o API does not yet include voice functionality. OpenAI has expressed concerns about the risk of abuse and plans to offer new audio capabilities to paying customers in the coming weeks. The multilingual capabilities of GPT-4o have also been upgraded. Its performance on English text and code matches that of GPT-4 Turbo, but its performance on non-English text has significantly improved. At the same time, the API is faster and costs have been reduced by 50%.

The ChatGPT update also includes a new user interface (UI) and a desktop version of ChatGPT for macOS. Users can ask questions to ChatGPT using keyboard shortcuts and discuss directly within the application through screen captures. Mira Murati stated, "We know these models are becoming more complex, but we want the interaction experience to become more natural and simpler, allowing you to focus entirely on collaborating with GPT without worrying about the user interface."

05

Free users of ChatGPT will also have access to the newly released GPT-4o model (previously only GPT-3.5 was available), enabling them to perform operations such as data analysis, image analysis, internet searches, and access to the app store. This also means that developers in the GPT app store will face a massive influx of new users.Of course, paying users will receive a higher message quota (OpenAI says at least 5 times more). When free users run out of messages, ChatGPT will automatically switch to GPT-3.5. Additionally, OpenAI will introduce an improved voice experience based on GPT-4o to Plus users in about a month, and currently, the GPT-4o API does not include voice functionality.

Furthermore, OpenAI will introduce an improved voice experience based on GPT-4o to Plus users in about a month, and currently, the GPT-4o API does not include voice functionality. Apple computer users will welcome a ChatGPT desktop application designed for macOS, where users can ask questions to ChatGPT by using the shortcut key "shoot" on the desktop. OpenAI has stated that a Windows version will be released later this year.

05

Once GPT-4o is launched, it seems that OpenAI's competitors can't sit still. Google quickly released a video previewing the Gemini large model's features on the social media platform X. In the video, this AI model is able to describe what is happening in the frame through the camera and provide real-time voice feedback, just as OpenAI has recently demonstrated. Google will host its annual I/O developer conference at 1 AM Beijing time on Wednesday, where it is expected to showcase a range of AI-related products.

In December last year, Google released version 1.0 of Gemini, claiming it has multimodal interaction capabilities. In the video demonstration, Gemini could perceive human movements in real-time and respond directly with voice. However, it was later revealed that the video was edited, and Google admitted to reducing latency and shortening Gemini's output time for the sake of demonstration effect.

Many people also compare OpenAI's GPT-4o with Apple's AI assistant Siri. According to previous reports from Bloomberg, Apple is on the verge of reaching an agreement with OpenAI and is finalizing the specific terms for applying ChatGPT features in the next-generation iPhone operating system, iOS18. Apple will hold its WWDC global developers conference in June, where it is expected to announce a series of artificial intelligence features.

It is worth mentioning that today's update from OpenAI seems to have brought the previously criticized A voice assistant back to the center of the stage. But in reality, as large models mature, a large number of AI companies have already laid out in this track in the past year and even made a series of commercialization attempts. It's just that their products do not appear in the form of traditional voice assistants in mobile phones and devices, but are wrapped in the concept of "AI companionship."

Now on TikTok, when you enter keywords like "AI dating" and "AI companion," you will find that the platform has a large number of related products and recommended views are all at the million-level. Some are in the form of combining AI with two-dimensional, cartoon images, while others are real-life AI images. Among them, the most popular at present include Character.ai, CrushOn, Talkie, Replika, and so on.Compared to the previous ChatGPT, which focused more on functional attributes, these AI products emphasize emotional companionship and emotional value, aiming to provide personalized social experiences for users with language styles that are closer to real people. From the current results, the user stickiness of AI companionship products is much higher than that of functional AI products. Functional AI products are often used in specific needs and scenarios seeking solutions, but the time and energy people invest in interacting with AI companionship products have become an emotional sustenance, making it a long-term interaction bond.

However, the real feeling of helplessness may be the entrepreneurial companies in the AI chat track.

AI Chat turns large models into engines for human imagination. Just like in the movie "Her," the protagonist can chat online with the virtual AI of Alan Wstts, who has been dead for many years.

Iterated to date, the leading company in the track, Character.ai, has launched Group Chat, where users can chat with celebrities like Napoleon, Musk, or Taylor · Swift at the same time.

AI role-playing has become a feasible direction for AIGC. Recently, an AI virtual character chat software, Talkie: Soulful AI (hereinafter referred to as Talkie), has been on the top 10 entertainment (free) list for a month in New Zealand, the UK, Canada, Australia, the United States, and other regions on Google Play. The company behind this software is MiniMax, the highest-valued startup in China in the first half of this year. In November last year, MiniMax launched a trial product, the AI companion software Glow, which is the predecessor of Talkie.

At the same time as the launch of Talkie, a domestic AI chat + card drawing software, Xingye, under the MiniMax co-founder, was launched. It can be reasonably inferred that Xingye is the domestic version of Talkie.

What makes Talkie stand out from various AI Chats is its card mechanism, which has attracted a large number of users who love OC (original character), AGC enthusiasts, and those who crave companionship and communication. During the natural communication process with the character Chat, if a specific topic is triggered, there will be an opportunity to draw CG cards, which is also an important way for Talkie to monetize.

AI Chat products have a strong market and traffic. Even during the period when Glow was taken off the shelf, users were looking for "substitutes," and domestic AI Chat tracks such as X Her, Zhu Meng Island, Caiyun Xiaomeng, and Aura AI are all trying to break through in product use design with innovative methods.Dream Island, launched by Shanghai Yuewen, was originally a potential feature within the Xiaoxiang Academy APP. Its product design aligns with Glow, including character information, opening remarks, and avatars for role boards, where users can also create their own mini-theater story settings and chat with AI robot characters. Compared to Glow, it has a longer context limit and memory.

Yuewen Group is one of the partners of MiniMax, and some people claim that Dream Island is actually integrated with its API.

However, with the emergence of GPT-4o, which can understand human emotions, whether it's OpenAI stepping in directly or opening up an API interface, it's probably not good news for startups in the AI companion chat race.

If you thought of "Her" (a movie about falling in love with an AI virtual person Samantha) or other AI-related futuristic dystopian movies, you're not alone. Talking to Chat GPT-4o in such a natural way is essentially OpenAI's "Her" moment. Considering it will be offered for free on mobile and desktop applications, many people might soon have their own "Her" moment.

Although not present in the live demonstration, OpenAI CEO Sam Altman made an important summary of the demonstration, saying that GPT-4o feels like the AI in the movies.

He said: "The new voice and video models GPT-40 are the best computer interfaces I've ever used; it feels like AI from the movies. And, for me, it's still a bit surprising how real it is, reaching human-level response speed and performance. The original ChatGPT showed the possibilities of the language interface, but this new thing, GPT-4o, feels fundamentally different; it's fast, intelligent, fun, natural, and practical."