Machine Learning

Top 10 Hugging Face Datasets for 2023

Barrett SSeptember 5, 20226 Mins read

Finding or building the right dataset to support your machine learning algorithm is the most important task of any machine-learning model. Your machine learning model might not work as intended if it isn’t built on solid foundations.

Kaggle is a well-known site that allows you to download thousands of suitable datasets. However, there are a few other providers who are becoming more popular. Hugging Face is the name of one we’ll be discussing in this article.

Hugging Face is an open-source dataset provider that’s primarily used for its natural language processing datasets (NLP). What is an NLP dataset? What are some of its uses?

NLP is a branch of artificial intelligence, that is responsible for human and computer interaction using natural language. It is concerned with processing large amounts (usually text) of human-understandable languages to find hidden patterns and insights.

NLP offers many real-life benefits, including the ability to categorize items (text), detect hate speech and filter out spam e-mails and messages.

We’ll be taking a deeper look at Hugging Face datasets, including what data they contain, how they are organized, and what they can serve.

Top 10 Hugging Face Datasets

1. IMDB Dataset

The IMDB database contains over 50,000 highly polar movie review reviews. These are classified as either positive or negative depending on whether the written comment was made.

The data is split into two equal parts: one for the training dataset, and one for testing. Additional unlabeled data can be added if needed. This dataset can detect positive or negative movie feedback in different texts. It can also identify which features of a movie were particularly liked or disliked.

Also read: What is Machine Learning Roadmap and How to Achieve Success

2. Amazon Polarity Dataset

This data set contains more than 35 million Amazon product reviews. Each data point contains the review of the customer and the rating for that product. Depending on the customer’s opinion of the product, each data point can be classified as either positive or negative.

This labeled dataset can be used in NLP or machine learning. Companies can increase their marketing and advertising capabilities by using the Amazon Polarity dataset. NLP techniques can be used to identify which products customers like and which features make them choose to purchase a product.

Similar datasets include Yelp full dataset, which includes a large number of reviews and is categorized by the rating they received (from 1-5). Similar to the Amazon dataset, NLP can help a restaurant or service provider market its products.

The Yelp review data and Amazon Polarity Datasets can also be used in recommendation systems for separating products or businesses into different types. The app or website can categorize customers to help them filter their preferences and improve organization.

3. Emotion Dataset

The emotions dataset divides English Twitter messages into six different categories.

Sadness
Joy
Love
Anger
Fear
Surprise

This dataset can be used for training and testing an NLP model. It focuses on capturing the user’s emotions by reading a passage of text from them. The anger and sadness data points categories can also be used to detect and eliminate discouragement messages (hate speech).

Similar datasets are Twitter-based. This dataset categorizes users’ tweets into various emojis, including happiness, love, laughter, and more. The tweet evaluation dataset, like the previous dataset, is also available for NLP. It focuses on emotions represented by emojis.

4. Common Voice Dataset

This dataset includes both textual and recording data points. The Common Voice data set contains more than 9000 hours of recorded messages and their written transcript counterparts. To improve the model’s voice recognition performance, additional data points like the accent, gender, age, and accent of the speaker can be obtained.

This data can be used to build and improve the accuracy of a voice recognition model that can understand over 60 languages from around the globe. Voice detection models are becoming more common in mainstream technology like Siri, Google Home, Alexa, and Alexa. All of these programs must understand multiple voices.

5. Silicone Dataset

This dataset categorizes sentences as a directive, directive, informative, or just regular questions. Silicone covers many domains, including telephone conversations, television dialogue, etc. All data points given are in English.

This data can be used to train and evaluate natural language models, and for understanding, systems specifically designed for spoken languages.

6. Yahoo Answers Topics Dataset

The Yahoo answer dataset contains a lot of questions and answers. Each data point (question or answer) is classified into a specific category. These genres include business sport & finance, society & culture, science & mathematics, family & relationships, computers & the internet, and more.

This data can be used to train a model that categorizes certain questions and answers in one of these categories.

7. Hate Speech Database

The hate speech dataset includes a selection of text messages taken from the Stormfront forum. Depending on the content of each data point, it is assigned hate or non-hate label. This dataset, as the name suggests, can be used to train models to detect hate speech on different online forums.

Similar data could be found in the hate Speech offensive dataset, which also contains this content. This data can be used to train a model that will filter out certain words and ban them from being said on forums, games with children demographics, and search bar inquiries.

8. Scan Dataset

The scan dataset is a language-driven task that allows you to study compositional learning and zero-shot generalization.

A data point you might find in the scanned dataset could be broken down into a command like a walk to the left twice. The actual action to be expected is to walk to the right twice.

9. SMS Spam Dataset

The SMS spam database contains more than 5,000 English SMS messages. These messages can be classified as spam or non-spam.

NLP is used to filter spam messages. An e-mail filtering program can be trained by using a labeled spam database or any system that requires spam filtering

10. Banking 77 Dataset

The Banking 77 database is more complex and includes over 13,000 customer messages (complaints or issues) that were sent to banks.

Each data point can be categorized into one or more of the seventy-seven intents. Intents are the customer who inquires about card delivery, card problems, an additional charge on the card, and declined transfers.

This type of data would enable banks to quickly respond and organize customer issues in a more organized way for future use. For any business that receives a lot of customer inquiries daily, similar models can be created. To run the model, you will need a well-filtered and processed dataset.

Also read: Image Annotation: What is it, Use Cases, Solutions and Types

Other Hugging Face Datasets

Three additional datasets are available from Hugging Face that you can explore.

1. Lair Informationset

The lair dataset includes more than 12 000 labeled statements by politicians from around the globe.

Each statement can be classified as false, partially true, mostly true, or true.

A machine learning model could be built using the lair dataset to determine the trustworthiness of future statements.

2. Google Well Formulated Query Dataset

This Google query dataset was created by crowdsourcing annotations for 25100 queries from Parallax. It labels each data point according to how informed it is.

Each query is annotated by five users as being either well-informed or not.

Machine learning models can also be used to predict the accuracy of a query by using this data.

3. Jfleg Dataset

The Jfleg dataset, which is considered a benchmark for English grammatical error correction data, is widely accepted. Each data point includes a written sentence with multiple spelling and grammar mistakes, as well as four additional grammatical or spelling-wise corrected sentences that were written by four people.

This dataset could be used to train our model so it can detect and correct grammar errors. This model, like most machine learning models, does not guarantee 100% correct spelling and grammar. The model’s performance will depend on what the task is (spam filter or hate speech detector, reviews), so choosing the right dataset can have a significant impact.

You can test the performance of your model by running it on some of the datasets mentioned above. You can also search the internet for your datasets to compare with those listed here.

Using Hugging Face Datasets

NLP has many uses. It can be used to organize text into different categories (for recommendation system processing), detect hate speech and filter out spam e-mails working with NLP is a skill worth learning.

This article focuses on Hugging Face, an open-source website that contains a large number of NLP datasets. It is primarily dedicated to NLP machine-learning models. We also covered 10 datasets to help improve your machine learning career.

Try out the examples and learn how to use the dataset with your machine-learning model. Hugging Face and other websites offer additional datasets that can be used to meet your model’s needs.

Written by

Barrett S

Barrett S is Sr. content manager of The Tech Trend. He is interested in the ways in which tech innovations can and will affect daily life. He loved to read books, magazines and music.