From Hype to Reality: Open-Source Giants Challenge Tech Titans with Free Models, Sparking a New Era of Innovation!!

The Journey
4 min readMar 21, 2024
Graphics Credits: The WSJ

AI, AI, AI! Everyone seems to be chasing after it. Let’s not confuse the hype around AI; it’s not solely driven by technology but also by money. Only a few companies might genuinely want to develop AI solutions; the rest are just blabbering to squeeze more money out of investors.

In the world of AI, big players like OpenAI and Microsoft have dominated the market for a while. But now, a new wave of companies is shaking things up by offering their AI models for free.

These companies believe that by making their technology open source, they can challenge the dominance of big tech firms.

Open-source means freely sharing technology for anyone to use, modify, and share. This approach has already revolutionized the internet and cloud computing.

Now, companies like Mistral AI and Hugging Face are betting they can do the same for AI.

OpenAI has been leading the pack, controlling a huge chunk of the AI market. But other companies are stepping up. Elon Musk’s startup, xAI, has open-sourced its chatbot Grok. Meta Platforms released its Llama 2 model for free, intensifying the competition. Google also joined in, sharing its Gemma models.

Plenty of startups are joining the open-source movement, too. Mistral AI, Hugging Face, Runway ML, and others are making their AI models available to everyone.

This is appealing to businesses because they can use these models without paying hefty fees to big tech companies.

But there are challenges. Training AI models can cost millions of dollars, and companies need to find ways to cover these costs. They also need to figure out how to make money from open-source AI. Some companies are offering paid services on top of their free models like Databricks helping businesses build custom models.

Despite the hurdles, venture capital funding for open-source AI startups is skyrocketing. Investors see the potential in these companies to innovate rapidly with the help of independent developers.

Open-source AI also creates ecosystems where other startups can thrive.

Together AI, for example, raised over $100 million by selling tools to businesses using open-source models.

However, there are still challenges ahead. AI models require more capital than traditional software, and licensing can be tricky. Some companies restrict certain uses of their open-source AI, which limits its usefulness.

Despite these challenges, the future looks bright for open-source AI. It’s shaking up the industry and giving power back to the people.

Let’s use the Hugging Face Transformers library, which provides a wide range of pre-trained models for natural language processing tasks. We’ll use a pre-trained model for text classification on the IMDb movie review dataset.

First, you need to install the Hugging Face Transformers library:

pip install transformers

Now, let’s write a Python script to load a pre-trained model and fine-tune it on the IMDb dataset for sentiment analysis:

import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset

# Load IMDb dataset
dataset = load_dataset("imdb")

# Load pre-trained model and tokenizer
model_name = "distilbert-base-uncased" # Choose any pre-trained model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Tokenize and preprocess the dataset
def tokenize_data(example):
return tokenizer(example["text"], truncation=True, padding="max_length")

tokenized_dataset = dataset.map(tokenize_data, batched=True)

# Prepare training and testing datasets
train_dataset = tokenized_dataset["train"].select(range(10000)) # Use only a subset for demonstration
test_dataset = tokenized_dataset["test"].select(range(2000)) # Use only a subset for demonstration

# Define training parameters
batch_size = 8
epochs = 3

# Prepare datasets for TensorFlow
train_dataset = train_dataset.with_format("tensorflow").with_columns(["labels": tf.cast(train_dataset["label"], tf.int32)])
test_dataset = test_dataset.with_format("tensorflow").with_columns(["labels": tf.cast(test_dataset["label"], tf.int32)])

# Train the model
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])
model.fit(train_dataset.shuffle(1000).batch(batch_size), epochs=epochs)

# Evaluate the model
eval_result = model.evaluate(test_dataset.batch(batch_size))
print("\nTest accuracy:", eval_result[1])

In this script:

  1. We load the IMDb dataset using the load_dataset function from the Hugging Face Datasets library.
  2. We choose a pre-trained model (in this case, DistilBERT) and load both the model and its corresponding tokenizer.
  3. We tokenize and preprocess the dataset using the tokenizer.
  4. We split the dataset into training and testing sets and formatted them for TensorFlow.
  5. We define training parameters, such as batch size and number of epochs.
  6. We compile and train the model using TensorFlow.
  7. Finally, we evaluate the model’s performance on the test dataset.

This example demonstrates how to leverage open-source pre-trained models and libraries for AI tasks like text classification.

Follow for more things on AI! The Journey — AI By Jasmin Bharadiya

--

--

The Journey

We welcome you to a new world of AI in the simplest way possible. Enjoy light-hearted and bite-sized AI articles.