The AI behemoth not only processes information but comprehends images and audio, transcending conventional data digestion in chatbot.


For the past year, Google has been playing catch-up with OpenAI. The release of ChatGPT marked a significant moment in the AI era, prompting the search giant to hasten its pace. Historically, Google was aggressive in AI research but sluggish in releasing tools to the public. This led to a surprise move by a nimble startup, forcing CEO Sundar Pichai to declare a ‘Code Red.’ The AI chatbot threat prompted founders Sergei Brin and Larry Page to come out of retirement at Pichai’s request.

After facing delays, Google unveiled its latest AI model, Gemini, this Wednesday. The timing couldn’t be better. A few weeks back, OpenAI experienced a board coup, resulting in a temporary removal of CEO Sam Altman. Google saw an opportunity to leverage the ripples of uncertainty that unsettled its competitor. The release comes after reports of prolonged anticipation, positioning Google to make a strategic move in the evolving landscape of AI.

Google found salvation in its vast multimodal data repository from search and YouTube. Gemini, trained like a curious infant, reshaped our notion of a large language model. Beyond mere data consumption, it possessed the extraordinary capability to grasp the essence of images and audio. This multimodal prowess represents a more comprehensive form of “intelligence,” marking a significant evolution in the landscape of language models.

While the conventional method for constructing multi-modal models involves training distinct components for each modality, Gemini took a distinctive route. It underwent training from the ground up, encompassing multiple modalities seamlessly. Google coined the term “natively multimodal” to emphasize this unique approach, setting Gemini apart from standard models by integrating diverse modalities seamlessly during its foundational training.

Awe-Struck Responses Of Chatbot

The demo videos showcasing the model evoked impressive reactions, revealing feats unseen in other AI models. Gemini exhibited remarkable abilities, recognizing a dot-to-dot picture as a crab before completion and tracking a paper ball beneath a plastic cup, even exposing sleight-of-hand tricks. In a departure from the norm, Gemini’s training employed Google’s in-house tensor processing units (TPUs) instead of the common graphics processing units (GPUs). This choice gains significance amid the prevalent GPU shortages affecting AI model builders.

Gemini caters to diverse platforms with three sizes: Nano for on-device tasks like text summarization, Pro, powering the AI chatbot Bard, and Ultra, a multimodal version set for release next year after safety checks. Starting December 13, developers can access the model through Google Cloud’s API. Notably product-oriented, Gemini integrates seamlessly into the Google ecosystem, distinguishing itself from other models in the market.

Further investigation into Google’s claims uncovered additional insights. Wharton professor Ethan Mollick demonstrated that ChatGPT effortlessly replicated tasks initially deemed impressive in the Gemini demo, such as step-by-step image analysis. University of Wisconsin-Madison associate professor Dimitris Papailiopoulos tested 14 multimodal reasoning examples from the Gemini research paper on ChatGPT-4. GPT-4V correctly handled 12 instances, with a few responses surpassing Gemini’s performance.

Google confessed to editing demo videos to reduce response time, as Bloomberg inquiries exposed inserted voice dialogue in the seemingly natural conversation with Gemini. The reality involved text prompts with sequential image display. Despite the embarrassing live demo mishap during Bard’s release, Google aimed to avoid it. Nevertheless, Gemini, beyond marketing nuances, has steered AI toward a broader scope beyond a mere conversational chatbot.

