The first episode of OpenAI Essentials: A Guide to Personal AI Applications

Introduction

Hello and welcome to the blog OpenAI Essentials: A Guide to Personal AI Applications - Episode 1. As the name suggests, this is the first episode of the series I am writing. In case you missed it, here is the link to the introduction of the series.

In this episode, we will look at a simplified theoretical concept of artificial intelligence and machine learning. Throughout this part, we will learn how artificial intelligence works using real-life analogies. Then, we will explore machine learning and neural networks. On top of that, I will explain important terminology in the context of artificial intelligence, for example, what are tokens, weights, and artificial neural networks.

Next, we will have a brief introduction to OpenAI, as well as its model offerings. I will explain the difference between language and non-language models. Then, we will look more in detail at each model, and I will guide you through their pricing and real-life applications.

By the end of the episode, you should have an idea of how to choose the best model suited for your personal AI assistant, depending on your needs and price range.

What is artificial intelligence, what is machine learning, and how does it work?

It's important to distinguish the difference between artificial intelligence (AI) and machine learning (ML), as these terms are commonly confused in discussions or popular media. That being said, to understand ML, you need at least a basic understanding of AI.

AI can be broadly referred to as a field in computer science that aims to create machines capable of performing tasks that humans can do - like reasoning, decision-making, pattern recognition, or understanding language. Usually, however, AI is referred to as a program or system that performs these tasks. In this article, I will adopt the same terminology.

While creating an AI might not sound too complicated at first, as I will explain, the process is far from straightforward. For instance, when my girlfriend asks me to GET UP and make coffee, as a human, I immediately understand the request within our relational and situational context. I know that making coffee involves going to the kitchen and physically preparing the drink, whilst considering her preferences. If we had a new coffee machine, I would probably be able to operate it based on intuition.

Unlike humans, AI processes such requests by analyzing patterns and data it has been trained on, without an explicit understanding of the social or emotional context. AI cannot truly 'understand' or 'feel'— it simply executes tasks based on its programming and learned data. If it receives data that has no similarity with what it has been trained on, instead of making coffee, it would possibly just boil the water, without realizing it needs to add coffee too.

So, you might be asking— How is it even possible that computers can mimic human actions so convincingly that you cannot tell the difference whether you interact with an AI or a human? Not only that, but AI can generate images from text, speak as a human, or even translate speech in real-time to another language. It can also reach an unbelievable performance in analyzing the data at speeds that a group of hundreds of analysts wouldn't be able to match.

Well, the unsettling part is that it works similarly to the human brain... In a way. The important word here is 'similarly'. Although our understanding of the brain's full capabilities is still incomplete, we know that it operates through a connected network of neurons that process and transmit information using tiny electrical impulses and chemical signals.

Building on this concept, a British mathematician and visionary named Alan Turing said - "May not machines carry out something which ought to be described as thinking but which is very different from what a man does?" Turing's theoretical framework provided a foundation that started further research into artificial intelligence, eventually leading to the modern AI we encounter daily. In accordance with Turing's proposal, AI systems consist of connected nodes called artificial neurons that process information. As AI processes the information, it is passed from one artificial neuron to another— this mirrors the functioning of the neural network in the human brain. Within the context of AI, we refer to this network as the artificial neural network.

As the information "travels" from one artificial neuron to another, each neuron assigns a weight to the information processed, identifying how strongly a particular feature should influence the final decision. These weights are adjusted as the model proceeds more and more information, improving the model's accuracy over time— a process known as machine learning.

Machine learning, in theory, should be an automated process, but it's not possible without human intervention, at least not initially. So, how does one train an AI model? We will consider a real-life scenario where the AI should differentiate between spam and non-spam emails. To enable AI to perform this task, we need to provide it with a set of initial data that has been already categorized, known as labeled data. That would involve a dataset of emails labeled as 'spam' and 'non-spam.'

Once we provide this data, the AI model will note some characteristic features of spam emails— for instance, it might notice that emails containing words like 'free' and 'win iPhone' frequently appear in spam emails, whereas it's uncommon for personal or business emails. Now, it might still label some emails incorrectly. In such cases, human interaction is necessary, providing the AI with feedback and adjustments.

This explanation is oversimplified, using simple analogies. Consider it 'just enough' for you to understand how AI/ML works and to provide basic information for using OpenAI's models. Later on, however, we will explore how to train our own models, and this will include much more detailed elaboration, so if you are interested in this topic, stay tuned!

Introduction to OpenAI and Its Offerings

OpenAI was originally a non-profit company developing artificial intelligence. As the potential and scope of AI grew, OpenAI transitioned to a capped-profit organization. This shift limited the profit of investors to a maximum return of 100 times their investment.

In order for the company to continue developing AI with security and ethics in mind, the change of its bussiness model was necessary. They eventually received $1 billion from multiple organizations, but the most significant contributor has been and continues to be Microsoft. As a result of these fundings, OpenAI now offers various models, which can be grouped into two main categories based on accessibility:

1. Models accessible to the general public

These are models that users can interact with through a user-friendly interface, such as an OpenAI web platform, allowing everyone to use the AI without technical expertise— a familiar example is chatGPT. At the time of writing, OpenAI offers a free GPT-3.5 model and more advanced, subscription-based, GPT-4 model. These so-called general language models are designed to provide conversational responses, but are also able to write some simple code. Here, users only have two models to choose from, although GPT-4 expands its capabilities to process images and sound as well.

2. Models accessible through APIs.

These advanced models offer greater flexibility and customization but require a more technical understanding. Through OpenAI APIs, developers and businesses can integrate AI capabilities directly into their applications— something I will demonstrate in later episodes of the blog. The API-accessible models include more powerful GPT-4 Turbo, image-generation models like DALL·E, Whisper for audio-to-text, TTS for text-to-speech conversion, and other models tailored for specific tasks.

Detailed exploration of API-Accessible models

API-accessible models are divided into two general categories: Language models and non-language models:

Language models are primarily designed to generate, understand, and interact with natural human language. ChatGPT is an example of such a model.

Non-language models are primarily focused on processing another form of data, for example, images or audio. Now, let's explore each model offered in both categories. However, before we proceed, let's clarify some terminology used in this section:

Tokens are pieces of words or entire words that AI models process as units of text. For instance, the word "Jozef" counts as one token, while "it's superb blog" might be split into four tokens: "it", "'s", "superb", and "blog". It's hard to estimate exactly how many tokens are used for each interaction with AI, but due to the low cost per token, it's generally okay to work with broad estimates.

Context Window describes the maximum amount of text (in tokens) that the model can work with at a time. A larger context window generally means the model can work with more information, therefore providing more coherent and appropriate outputs.

Input and Output Costs refer to the pricing of models based on the number of tokens the model processes when receiving an input (input cost) and generating output (output cost).

The following part about pricing is accurate as of April 25, 2024. Although I will update the blog periodically, for the most up-to-date pricing details, please refer to the OpenAI pricing documentation. However, the genral description and explanation of each model is up-to-date.

Language models:

GPT-4 Series:

GPT-4 is the most powerful series that can process text, speech, and images as input and generate text as output. It can solve complex problems thanks to its broader knowledge and advanced reasoning. Therefore, the model could be useful for fact-checking an article or analyzing big chunks of text.

GPT-4 Turbo

This latest model includes vision capabilities, allowing it to identify objects in the image.

Feature	Details
Context Window	128,000 tokens
Training Data	December 2023
Input price per 1M tokens	$10.00
Output price per 1M tokens	$30.00

GPT-4

An older but still powerful model, primarily focused on text inputs.

Feature	Details
Context Window	8,192 tokens
Training Data	September 2021
Input price per 1M tokens	$30.00
Output price per 1M tokens	$60.00 per 1 million tokens

It's clear that OpenAI still offers its older models. However, unless you need the specific capabilities of an older model, you should opt for the latest stable model. The newest GPT-4-turbo is not only the most advanced GPT, but also the cheapest in its GPT4 family— that's why it's usually my preferred model.

By the way, did you notice the difference in the context window of GPT-4 and GPT-4-turbo? It's incredible how fast we move forward in the AI field. We can assume that GPT-5, or at least GPT-4.5 is on its way. Additionally, the most impressive aspect might be what is just yet to come, combining AI with quantum computing!

GPT-3.5 Series:

GPT-3 models were a breakthrough in AI chat completion. As the largest model available back in 2020, it was an unprecedented move forward in AI systems. GPT-3 then evolved into GPT-3.5— Today, it's slightly behind its big brother GPT-4 in terms of power, reasoning, and advancements, but it remains more cost-efficient. However, don't get me wrong, GPT-3.5 is still a very powerful model!

A great example of its application is spell-checking an article, where a deep understanding of the text is not as crucial.

GPT-3.5 Turbo

The latest in the GPT-3.5 Turbo lineup.

Feature	Details
Context Window	16,385 tokens
Training Data	September 2021
Input price per 1M tokens	$0.50
Output price per 1M tokens	$1.50

GPT-3.5 Turbo-Instruct

A specialized version within the GPT-3.5 Turbo family, designed to follow instructions more precisely. Best suited for scenarios where specific and direct responses are needed.

Feature	Details
Context Window	Limited to 4,000 tokens
Training Data	September 2021
Input price per 1M tokens	$1.50
Output price per 1M tokens	$2.00

Fine-Tuning Models:

Fine-tuning allows you to customize OpenAI's base models according to your or your organization's needs. While customizing a model provides great flexibility, it requires additional investments—Not only money but also time and resources.

Usually, custom models serve in specific scenarios— one example would be if you wanted to train a sarcastic model. You would need to provide it with exemplary questions and exemplary sarcastic answers. The number of examples required may vary; generally, for models like GPT-3.5-turbo, the starting point would be from 50 to 100 training examples. After that, you could see an actual improvement in responses.

Models trained on larger datasets are generally more challenging to fine-tune for specific tasks, as they have been optimized to do the opposite— their applications are supposed to be broad. On the other hand, models trained on smaller datasets might be easier to adapt to niche tasks, as they are less pre-conditioned and more adaptable.

There is a fine line between choosing the correct model to fine-tune. Apart from pricing, finding a balance between leveraging existing model capabilities and the level of customization is crucial.

Let's explore three currently offered models that can be finetuned, but before we proceed, let me set something clear. The earlier example of a sarcastic chatbot is just for illustration. It's important to note that you can already generate sarcastic responses with existing models— simply by using the correct prompts.

GPT-3.5 Turbo

Ideal for applications requiring conversational capabilities like ChatBots. In this scenario, fine-tuning the GPT-3.5 is highly effective. While GPT-3.5 may not handle complex or comprehensive language tasks as effectively as DaVinci (model trained on a larger dataset), it can provide superb conversational responses.

Cost Type	Amount
Training Cost per 1M tokens	$8.00
Input price per 1M tokens	$3.00
Output price per 1M tokens	$6.00

Davinci-002

Ideal for content generation, translations, or complex language tasks. Davinci model has been trained on a broader and more varied dataset than the GPT-3.5-turbo.

Cost Type	Amount
Training Cost	$6.00 per 1 million tokens
Input Usage Cost	$12.00 per 1 million tokens
Output Usage Cost	$12.00 per 1 million tokens

Babbage-002

Cost-effective option, but it's been trained on a small dataset. From my experience, it requires extensive training and precise prompts. What's more, it's supposed to be depreciated in the near future.

Cost Type	Amount
Training Cost	$0.40 per 1 million tokens
Input Usage Cost	$1.60 per 1 million tokens
Output Usage Cost	$1.60 per 1 million tokens

Embedding Models

These models are useful when you need to understand and categorize large volumes of text data. They can find semantically similar words, for instance, "joy" and "happiness" could be categorized as positive emotions.

Furthermore, embedding models can cluster information with similar meanings, like hockey articles being clustered with football articles, as both are sports. (Yes, you read that correctly, football- the European way). Although there are many more use cases, to name just a few: e-commerce recommendations based on customers' previous choices, anomaly detection to spot mistakes in data, detecting fraud emails, or labelling them as urgent and non-urgent.

When the embedding model processes a piece of text, it transforms the information from text into a vector of floating-point numbers. You can imagine this as a numerical representation of real-world objects. For example, humans know that the Sun and Moon are somehow analogical objects, but computers have no idea. Therefore, embedding models assign a numerical value to both— the Sun and the Moon in order to understand their relationship.

Conclusively, the longer the distance between vectors, the more unrelated the objects are. For example, the vector distance between the Sun and the Moon might be bigger than between a bee and a wasp. Embedding models essentially create large volumes of these numerical vectors.

Now, let's go back to terminology for a bit— Soon, you will notice a new term - an output dimension. Explained in simple terms, an output dimension of 300 means that every text processed by the embedding model is converted into a vector of 300 numbers. Each number represents a different aspect of a text.

That would mean the bigger the output dimension, the better huh? Well, not always, as the price for a bigger output dimension is usually latency. So, if you need real-time data analysis, you would choose a low-dimensional model, whilst if you need high accuracy, a high-dimensional model would be a better fit. Now, let's have a look at models offered:

Text-Embedding-3-Small

Ideal for applications requiring fast, cost-effective text analysis. Used in customer support systems to quickly categorize and route customer inquiries based on their content.

Feature	Details
Output Dimension	1,536
Usage Cost	$0.02 per 1M tokens

Text-Embedding-3-Large

Provides more detailed embeddings with higher capacity. Used in legal, healthcare, or educational data analysis to identify and extract relevant information from greater volumes of data.

Feature	Details
Output Dimension	3,072
Usage Cost	$0.13 per 1M tokens

Ada V2

An advanced second-generation model that combines speed and accuracy. Ada V2 also features improved algorithms for better contextual understanding. Used in E-commerce for systems that suggest products based on textual similarity in product descriptions and customer reviews.

Feature	Details
Output Dimension	1,536
Usage Cost	$0.10 per 1M tokens

Assistant API:

Assistant API allows developers to integrate AI assistants directly into their applications. To clarify further, this API essentially enables you to create THE Chatbot of chatbots. You can predefine language models and leverage advanced customization tools like sentiment analysis, content filtering, and summarization. The assistant can serve a variety of roles, including tutor, e-commerce advisor, content generator, interactive game entertainer, or healthcare adviser for basic inquiries. On top of that, OpenAI currently offers powerful tools to extend the assistant applications :

Code Interpreter

Allows the AI assistant to execute and interact with code in real-time. Incredibly useful for any tech environment, this tool provides real-time coding assistance that enhances code quality. It can simulate expected code outcomes, allowing developers to see potential results without actual code execution—therefore, streamlining the development process.

Pricing
$0.03 per session. Additionally, you pay for the tokens used.

File Search

For efficient searching through large data sets and file management systems. Whether retrieving customer info or accessing historical data, file search can locate the information in the datasets based on instructions. With a combination of other tools, it can automatically follow instructions, based on the data it found.

Pricing
Costs $0.10 per GB of vector storage per day, with the first 1 GB free.

Function Calling

Enables the assistant to execute specific functions based on user input, queries, or predefined triggers. For instance, it could reset a customer's password based on his input. This tool is priceless, literally. OpenAI does not charge for using it.

Base Models

Base models are early models offered by OpenAI. Generally speaking, these are already considered "legacy" models. Therefore, instead of using them, OpenAI recommends using GPT-3.5 or GPT-4. Base models are still able to understand natural language, but are unable to follow instructions to the extent other models can. On the other hand, they are cheap...

Model	Usage Cost
Davinci-002	$2.00 per 1 million tokens
Babbage-002	$0.40 per 1 million tokens

Non-language models:

As described earlier, OpenAI offers, apart from language models, non-language models— These can be categorized as models that do not work primarily with text, but rather another form of human language like pictures and audio. Using APIs, you can integrate these models into applications for image generation or sophisticated audio.

Image DALL·E Models:

OpenAI's DALL·E models can generate images from text. I believe this does not require a more detailed explanation, as their usage is obvious. Regardless, it's important to address the limitations of these models— from my experience, they are still far from perfect when you need to generate specific image.

For example, they are unable to generate electrical schemes, or at least I wasn't able to find that "sweet, sweet prompt" to do so. On the other hand, if you are a fan of human-like cats, or if you need to generate logo, you are in luck.

Speaking from experience, I found it useful when I needed to generate images for my web-based board game in a particular color/scheme/temperature. I was also able to generate realistic images of dogs, cats, or horses and if you are a madman and decide to combine these three animals, you can create a cat-horse-dog mutant. Finally, thumbnails for my blog posts are generated by DALL·E, so you can a real-life example.

You have the option to try DALL·E for free using Microsoft's AI chat as they are the biggest investors in the company, they officially provide some of the OpenAI functionalities— including the DALL·E 3 model. Although, for some reason, I found it much less efficient in following prompts than OpenAI's DALL·E model. Here is the detailed pricing of each model:

Model	Resolution	Quality	Price Per Image
DALL·E 3	1024×1024	Standard	$0.040
DALL·E 3	1024×1024	HD	$0.080
DALL·E 2	1024×1024	-	$0.020
	512×512	-	$0.018
	256×256	-	$0.016

Audio models:

Audio models offer text-to-speech or speech-to-text conversion. Their speech, in fact, sound just like natural human language, rather than robotic voice. OpenAI offers APIs, so developers can implement these features into applications.

I can imagine big implications in the future for people with sight impairments, or oppositely, for people with hearing problems, as these systems work both ways. Now, let's have a look at three models currently in offer:

Whisper:

Whisper is a speech recognition model capable of understanding and transcribing speech from multiple languages. In other words, it can translate spoken language into English and identify the language's origin.

It's important to note that, as of the time of writing, there is no difference between using OpenAI's open-source Whisper and the one using APIs, apart from speed. Therefore, only if you need super-fast speech recognition, you should leverage the API version.

Model	Usage Cost
Whisper	$0.006 per minute

Text-to-speech (TTS)

TTS models transform written text into natural-sounding audio. That's it, there's not really much more to it, apart from two models offered by OpenAI:

TTS-1

Description: Optimized for real-time text-to-speech applications.

TTS-1 HD

Description: Delivers high-definition audio quality.

Here is a simple comparison of these models' pricing:

Model	Usage Cost
TTS-1	$15.00 per 1 million characters
TTS-1 HD	$30.00 per 1 million characters

Conclusions and what to expect next

So, here we come to the end of the first episode. By now, you should have theoretical foundations in artificial intelligence and machine learning. Hopefully, you also have an idea about what OpenAI is and the model it offers.

In the next blog, we’ll develop a simple program that you can run locally- it will be our personal ChatGPT for a fraction of the price- it costs around 3-5 cents per interaction.

Now, if you made it this far and you are interested in the next episode, make sure to add me on LinkedIn- where I will post a post whenever a new blog is released, or follow me on Instagram, where apart from upcoming episodes, you can follow my personal life.

I Want to Hear from You!

Your thoughts and feedback are valuable to me. Feel free to share your ideas, questions, or topics you’re curious about using contact information on hanektech.com. Meanwhile, stay cool, just like me. :)

AI basicsUnderstanding AIOpenAI guideNeural networkOpenAI applicationsAI for beginners