GPT-4o (“o” for “omni”) from OpenAI, the Gemini family of models from Google, and the Claude family of models from Anthropic are the state-of-the-art large language models (LLMs) models that are currently available in the Generative Artificial Intelligence space. GPT-4o was released recently from OpenAI while Google announced the Gemini 1.5 models in early February of 2024.
The advanced version of GPT-4o comes with the capability of multimodality; it accepts any combination of text, audio, image, or video inputs and produces outputs in text, audio, and image forms. When compared to its predecessor, "GPT-4-turbo," this exceeds the performance by at least 30% faster processing and at least 50% lower costs, making it suitable for practical, production-grade AI development services.
Meanwhile, Gemini currently offers 4 model variants,
- Gemini 1.5 Pro - Optimized for complex reasoning tasks like code generation, problem-solving, data extraction, and generation.
- Gemini 1.5 Flash - Fast and versatile performance across a diverse variety of tasks.
- Gemini 1.0 Pro - Supports common Natural language tasks, multi-turn text and code chat, and code generation.
- Gemini 1.0 Pro Vision - Curated for visual-related tasks, like generating image descriptions or identifying objects in images.
At an AI services and custom software development company like Techjays, we plow with these tools on a daily basis and even the nitty gritties matter in our processes.
Benchmarks:
Common benchmarks used to evaluate large language models (LLMs) assess a wide range of capabilities, including multitasking language understanding, answering graduate-level technical questions, mathematical reasoning, code generation, multilingual performance, and arithmetic problem-solving abilities. In most of these evaluation benchmarks, OpenAI's GPT-4o has demonstrated superior performance compared to the various Gemini model variants from Google, solidifying its position as the overall best model in terms of the quality of its outputs.
LLMs that require larger input contexts can cause problems for AI development services because the models may forget specific pieces of information while answering. This could significantly degrade the performance on tasks like multi-document question answering or retrieving information located in the middle of long contexts. The designed new benchmark titled "Needle in a Needlestack" addresses this problem by measuring whether LLMs within AI development services pay attention to information appearing in their context window.
Image source - Needlestack
Images: Comparison of information retrieval performance between GPT-4-turbo, GPT-4o, and Gemini-1.5-pro relative to the token position of the input content.
GPT-4-turbo performance degrades significantly when the relevant information is present in the middle of the input context. GPT-4o provides much better results in this metric allowing for longer input contexts. However, GPT-4o failed to match the overall consistency of Gemini-1.5-pro making it the ideal choice for tasks requiring larger inputs.
API Access:
Both GPT-4o and Gemini model variants are available through API access and would require an API key to use the models.
OpenAI provides official client SDKs in Python and NodeJS. Besides the official libraries, there are community-maintained libraries for all the popular languages like C#/.NET, C++, Java, and so on. One could also make direct HTTP requests for model access. Refer to the OpenAI API (documentation) for more information.
Google provides Gemini access through (Google AI Studio) and API access with client SDK libraries in popular languages like Python, JavaScript, Go, Dart, and Swift. Refer to the official Gemini (documentation) for further information.
In-Depth Model Specifications:
Gemini models with 1 million context window limits have double the rate for inputs with context lengths greater than 128k.
Source: OpenAI pricing
Feature Comparison:
- Context Caching: GoogleI offers context caching features for the Gemini 1.5 Pro variant to reduce the cost when consecutive API usage contains repeat content with high input token counts. This feature is well suited when we need to provide common context like extensive system instructions for a chatbot that would be applicable for many consecutive API requests. OpenAI as of now doesn’t have support for this feature with GPT-4o or other GPT model variants.
- Batch API: This feature is useful in scenarios where we have to process a group of inputs like running test cases with LLM and we don’t require an immediate response from the LLM. OpenAI is currently offering Batch API to send asynchronous groups of requests with 50% lower costs, higher rate limits, and a 24-hour time window within which we can get the results. This feature is particularly useful in saving cost in the development phase of Gen AI applications which would involve rigorous testing and in scenarios where we don’t require an immediate response. Google is not offering Gemini under the same Batch API features but batch predictions are available as a Beta feature in Google Cloud Vertex AI to process multiple inputs simultaneously.
- Speed/Throughput Comparison: The speed of a LLM model is quantified by tokens/per second received while the model is generating tokens. Gemini 1.5 Flash is reported to be the best model out of all popular LLMs in terms of tokens/per second. GPT-4o is nearly 2 times faster than its predecessor GPT-4-turbo in terms of inference speed but it still falls significantly behind the Gemini 1.5 Flash. However, GPT-4o is still faster than the advanced Gemini variant Gemini 1.5 Pro. Gemini’s 1M token context window also allows for longer inputs which will impact the speed.
Nature of Responses from GPT-4o and Gemini:
- Gemini has been recognized for its ability to make responses sound more human compared to GPT-4o. This, along with its ability to create draft response versions in the Gemini App makes it suitable for creative writing tasks such as marketing content, sales pitch, writing essays, articles, and stories.
- GPT-4o responses are a bit more monotonic, but its consistency in response to analytical questions has proven to be better, making it ideal for deterministic tasks such as code generation, problem-solving, and so on.
- Furthermore, Google has recently faced some public backlash regarding the restrictiveness of responses from Gemini. A recent thread on Hacker News raised concerns that Gemini was refusing to answer questions related to C++ language as it is deemed unsafe for under-18-aged users. Google had to face another incident regarding Gemini’s image generation where Gemini was generating historically inaccurate images when prompted with queries about the historical depiction of certain groups. Google temporarily paused the feature after issuing a statement acknowledging the inaccuracies.
- Both GPT-4o and Gemini have sufficient safeguards to protect against malicious actors trying to get responses regarding extreme content. However, this has raised concerns about the models being too restrictive and inherently biased towards certain political factions where they decline to respond to one group in the political spectrum while answering freely for other groups.
- OpenAI faced allegations that GPT-4 had become “lazy” shortly after the introduction of GPT-4-Turbo back in November 2023. The accusations were mostly centered around GPT-4’s inability to follow complete instructions. It is believed that this laziness is mainly attributed to GPT forgetting instructions that are placed in the middle of the prompt. However, with GPT-4o exhibiting better performance in the Needle in a NeedleStack benchmark, GPT-4o is now better at following all the instructions.
- Based on the nature and quality of answers produced by GPT-4o and Gemini, below given are the opinionated preferences between GPT-4o and Gemini for various use cases.
RAG vs Gemini’s 1M Long Context Window:
Retrieval Augmented Generation or RAG for short is the process through which we can provide relevant external knowledge context as input to answer a user’s question. This technique is effective when the inherent knowledge of LLM is insufficient to provide an accurate answer. RAG is crucial for building custom LLM-based chatbots for domain-specific knowledge bases such as internal company documents, brochures, and so on. It also aids in improving the accuracy of answers and reduces the likelihood of hallucinations. For example, take an LLM-based chatbot that can provide answers from internal company documents. Given the limited context window of LLMs, it is difficult to pass the entire documents as context to the LLM. The RAG pipeline allows us to filter out document chunks that are relevant to user questions using NLP techniques and pass them as context.
The 1M context window of Gemini allows for the possibility of passing large documents as context without the use of RAG. Moreover, this approach could provide better performance if the retrieval performance of RAG is poor for the given set of documents. There’s also an expectation that as the LLM capabilities improve over time, the context windows and latency would also improve proportionally negating the need for RAG.
While the longer context window makes a compelling case over RAG, it comes with a significant increase in cost per request and is wasteful in terms of compute usage. Increased latency and performance degradation due to context pollution would make it challenging to adopt this approach. Despite the expectation of context windows getting larger over time and the fallible nature of NLP techniques employed by RAG, RAG is still the optimal and scalable approach for a large corpus of external knowledge.
Rate Limits:
Given the high compute nature of LLM inference, rate limits are set in place on both Gemini and GPT-4o. Rate limits are intended to avoid misuse by malicious actors and to ensure uninterrupted service to all active users.
- OpenAI follows a tier-based rate limit approach. The free tier sets rate limits for GPT-3.5-turbo and text embedding models. There are five tiers placed above the free tier from Tier 1 to Tier 5. Users will be bumped to higher tiers with better rate limits as their usage of the API increases. So Tier 5 users will have the best rate limits to accommodate for their high usage needs. Refer to the usage tiers documentation from OpenAI for detailed information on Tier limits. Below given are the rate limits for GPT-4o.
- Google, on the other hand, provides Gemini in two modes: Free of Charge and Pay-as-you-go. Refer to the pricing for up-to-date information on the rate limits. Below are the detailed rate limits for Gemini model variants
Conclusion:
More general GPT-4o provides the best capabilities with the strongest, most consistent, and reliable ones answering questions, making it good for AI development services. Where, Gemini has brought in a variety of broad features which fit beneficially in AI development services such as longer context windows, context caching, and faster mini-model variants than similar offerings like GPT-3.5-turbo, from OpenAI. Last but not least, Gemini provides a rather liberal-free tier limit for accessing APIs, though OpenAI has made GPT-4o free for all tiers of users on ChatGPT.
For those looking to invest in AI, the choice between GPT-4o and Gemini will ultimately come down to the problem requirements and cost-benefit analysis in your AI services journey. For problems or projects that have heavy requirements for analysis, mathematical reasoning, and code generation, GPT-4o seems to be the best option with Gemini 1.5 Pro falling close by. For AI services tasks that require a good level of creativity like story writing, Gemini model variants seem to have inherent qualities that make them well-suited for such creative endeavors. Some tasks will require longer context windows like Document Question Answering, and processes that involve a high number of steps. When it comes to these kinds of tasks, Gemini emerges as the most suitable choice, offering an impressive 1M input context limit and superior information retrieval capabilities that surpass those of GPT-4o.