Subscribe to the PwC Newsletter

Join the community, edit dataset, edit dataset tasks.

Some tasks are inferred based on the benchmarks list.

Add a Data Loader

Remove a data loader.

  • huggingface/datasets -

Edit Dataset Modalities

Edit dataset languages, edit dataset variants.

The benchmarks section lists all benchmarks using a given dataset or any of its variants. We use variants to distinguish between results evaluated on slightly different versions of the same dataset. For example, ImageNet 32⨉32 and ImageNet 64⨉64 are variants of the ImageNet dataset.

Add a new evaluation result row

Mr (mr movie reviews).

text movie review dataset

MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.

Benchmarks Edit Add a new result Link an existing benchmark

Trend Task Dataset Variant Best Model Paper Code
Paper Code Results Date Stars

Dataset Loaders Edit Add Remove

text movie review dataset

Similar Datasets

License edit, modalities edit, languages edit.

text movie review dataset

IMDB Large Movie Review Dataset

The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg).

http://ai.stanford.edu/~amaas/data/sentiment/

Character, path to directory where data will be stored. If NULL , user_cache_dir will be used to determine path.

Character. Return training ("train") data or testing ("test") data. Defaults to "train".

Logical, set TRUE to delete dataset.

Logical, set TRUE to return the path of the dataset.

Logical, set TRUE to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.

Logical, set TRUE if you have manually downloaded the file and placed it in the folder designated by running this function with return_path = TRUE .

A tibble with 25,000 rows and 2 variables:

Character, denoting the sentiment

Character, text of the review

In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.

When using this dataset, please cite the ACL 2011 paper

InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = {http://www.aclweb.org/anthology/P11-1015} }

Use Sentiment Analysis With Python to Classify Movie Reviews

Use Sentiment Analysis With Python to Classify Movie Reviews

Table of Contents

Removing Stop Words

Normalizing words, vectorizing text, machine learning tools, how classification works, how to use spacy for text classification, loading and preprocessing data, training your classifier, classifying reviews, connecting the pipeline, next steps with sentiment analysis and python.

Sentiment analysis is a powerful tool that allows computers to understand the underlying subjective tone of a piece of writing. This is something that humans have difficulty with, and as you might imagine, it isn’t always so easy for computers, either. But with the right tools and Python, you can use sentiment analysis to better understand the sentiment of a piece of writing.

Why would you want to do that? There are a lot of uses for sentiment analysis, such as understanding how stock traders feel about a particular company by using social media data or aggregating reviews, which you’ll get to do by the end of this tutorial.

In this tutorial, you’ll learn:

  • How to use natural language processing (NLP) techniques
  • How to use machine learning to determine the sentiment of text
  • How to use spaCy to build an NLP pipeline that feeds into a sentiment analysis classifier

This tutorial is ideal for beginning machine learning practitioners who want a project-focused guide to building sentiment analysis pipelines with spaCy.

You should be familiar with basic machine learning techniques like binary classification as well as the concepts behind them, such as training loops, data batches, and weights and biases. If you’re unfamiliar with machine learning, then you can kickstart your journey by learning about logistic regression .

When you’re ready, you can follow along with the examples in this tutorial by downloading the source code from the link below:

Get the Source Code: Click here to get the source code you’ll use to learn about sentiment analysis with natural language processing in this tutorial.

Using Natural Language Processing to Preprocess and Clean Text Data

Any sentiment analysis workflow begins with loading data. But what do you do once the data’s been loaded? You need to process it through a natural language processing pipeline before you can do anything interesting with it.

The necessary steps include (but aren’t limited to) the following:

  • Tokenizing sentences to break text down into sentences, words, or other units
  • Removing stop words like “if,” “but,” “or,” and so on
  • Normalizing words by condensing all forms of a word into a single form
  • Vectorizing text by turning the text into a numerical representation for consumption by your classifier

All these steps serve to reduce the noise inherent in any human-readable text and improve the accuracy of your classifier’s results. There are lots of great tools to help with this, such as the Natural Language Toolkit , TextBlob , and spaCy . For this tutorial, you’ll use spaCy.

Note: spaCy is a very powerful tool with many features. For a deep dive into many of these features, check out Natural Language Processing With spaCy .

Before you go further, make sure you have spaCy and its English model installed:

The first command installs spaCy, and the second uses spaCy to download its English language model. spaCy supports a number of different languages, which are listed on the spaCy website .

Warning: This tutorial only works with spaCy 2.X and is not compatible with spaCy 3.0. For the best experience, please install the version specificed above.

Next, you’ll learn how to use spaCy to help with the preprocessing steps you learned about earlier, starting with tokenization.

Tokenization is the process of breaking down chunks of text into smaller pieces. spaCy comes with a default processing pipeline that begins with tokenization, making this process a snap. In spaCy, you can do either sentence tokenization or word tokenization:

  • Word tokenization breaks text down into individual words.
  • Sentence tokenization breaks text down into individual sentences.

In this tutorial, you’ll use word tokenization to separate the text into individual words. First, you’ll load the text into spaCy, which does the work of tokenization for you:

In this code, you set up some example text to tokenize, load spaCy’s English model, and then tokenize the text by passing it into the nlp constructor. This model includes a default processing pipeline that you can customize, as you’ll see later in the project section.

After that, you generate a list of tokens and print it. As you may have noticed, “word tokenization” is a slightly misleading term, as captured tokens include punctuation and other nonword strings.

Tokens are an important container type in spaCy and have a very rich set of features. In the next section, you’ll learn how to use one of those features to filter out stop words.

Stop words are words that may be important in human communication but are of little value for machines. spaCy comes with a default list of stop words that you can customize. For now, you’ll see how you can use token attributes to remove stop words:

In one line of Python code, you filter out stop words from the tokenized text using the .is_stop token attribute.

What differences do you notice between this output and the output you got after tokenizing the text? With the stop words removed, the token list is much shorter, and there’s less context to help you understand the tokens.

Normalization is a little more complex than tokenization. It entails condensing all forms of a word into a single representation of that word. For instance, “watched,” “watching,” and “watches” can all be normalized into “watch.” There are two major normalization methods:

  • Lemmatization

With stemming , a word is cut off at its stem , the smallest unit of that word from which you can create the descendant words. You just saw an example of this above with “watch.” Stemming simply truncates the string using common endings, so it will miss the relationship between “feel” and “felt,” for example.

Lemmatization seeks to address this issue. This process uses a data structure that relates all forms of a word back to its simplest form, or lemma . Because lemmatization is generally more powerful than stemming, it’s the only normalization strategy offered by spaCy.

Luckily, you don’t need any additional code to do this. It happens automatically—along with a number of other activities, such as part of speech tagging and named entity recognition —when you call nlp() . You can inspect the lemma for each token by taking advantage of the .lemma_ attribute:

All you did here was generate a readable list of tokens and lemmas by iterating through the filtered list of tokens, taking advantage of the .lemma_ attribute to inspect the lemmas. This example shows only the first few tokens and lemmas. Your output will be much longer.

Note: Notice the underscore on the .lemma_ attribute. That’s not a typo. It’s a convention in spaCy that gets the human-readable version of the attribute .

The next step is to represent each token in way that a machine can understand. This is called vectorization .

Vectorization is a process that transforms a token into a vector , or a numeric array that, in the context of NLP, is unique to and represents various features of a token. Vectors are used under the hood to find word similarities, classify text, and perform other NLP operations.

This particular representation is a dense array , one in which there are defined values for every space in the array. This is in opposition to earlier methods that used sparse arrays , in which most spaces are empty.

Like the other steps, vectorization is taken care of automatically with the nlp() call. Since you already have a list of token objects, you can get the vector representation of one of the tokens like so:

Here you use the .vector attribute on the second token in the filtered_tokens list, which in this set of examples is the word Dave .

Note: If you get different results for the .vector attribute, don’t worry. This could be because you’re using a different version of the en_core_web_sm model or, potentially, of spaCy itself.

Now that you’ve learned about some of the typical text preprocessing steps in spaCy, you’ll learn how to classify text.

Using Machine Learning Classifiers to Predict Sentiment

Your text is now processed into a form understandable by your computer, so you can start to work on classifying it according to its sentiment. You’ll cover three topics that will give you a general understanding of machine learning classification of text data:

  • What machine learning tools are available and how they’re used
  • How classification works
  • How to use spaCy for text classification

First, you’ll learn about some of the available tools for doing machine learning classification.

There are a number of tools available in Python for solving classification problems. Here are some of the more popular ones:

  • scikit-learn

This list isn’t all-inclusive, but these are the more widely used machine learning frameworks available in Python. They’re large, powerful frameworks that take a lot of time to truly master and understand.

TensorFlow is developed by Google and is one of the most popular machine learning frameworks. You use it primarily to implement your own machine learning algorithms as opposed to using existing algorithms. It’s fairly low-level, which gives the user a lot of power, but it comes with a steep learning curve.

PyTorch is Facebook’s answer to TensorFlow and accomplishes many of the same goals. However, it’s built to be more familiar to Python programmers and has become a very popular framework in its own right. Because they have similar use cases, comparing TensorFlow and PyTorch is a useful exercise if you’re considering learning a framework.

scikit-learn stands in contrast to TensorFlow and PyTorch. It’s higher-level and allows you to use off-the-shelf machine learning algorithms rather than building your own. What it lacks in customizability, it more than makes up for in ease of use, allowing you to quickly train classifiers in just a few lines of code.

Luckily, spaCy provides a fairly straightforward built-in text classifier that you’ll learn about a little later. First, however, it’s important to understand the general workflow for any sort of classification problem.

Don’t worry—for this section you won’t go deep into linear algebra , vector spaces, or other esoteric concepts that power machine learning in general. Instead, you’ll get a practical introduction to the workflow and constraints common to classification problems.

Once you have your vectorized data, a basic workflow for classification looks like this:

  • Split your data into training and evaluation sets.
  • Select a model architecture.
  • Use training data to train your model.
  • Use test data to evaluate the performance of your model.
  • Use your trained model on new data to generate predictions, which in this case will be a number between -1.0 and 1.0.

This list isn’t exhaustive, and there are a number of additional steps and variations that can be done in an attempt to improve accuracy. For example, machine learning practitioners often split their datasets into three sets:

The training set , as the name implies, is used to train your model. The validation set is used to help tune the hyperparameters of your model, which can lead to better performance.

Note: Hyperparameters control the training process and structure of your model and can include things like learning rate and batch size. However, which hyperparameters are available depends very much on the model you choose to use.

The test set is a dataset that incorporates a wide variety of data to accurately judge the performance of the model. Test sets are often used to compare multiple models, including the same models at different stages of training.

Now that you’ve learned the general flow of classification, it’s time to put it into action with spaCy.

You’ve already learned how spaCy does much of the text preprocessing work for you with the nlp() constructor. This is really helpful since training a classification model requires many examples to be useful.

Additionally, spaCy provides a pipeline functionality that powers much of the magic that happens under the hood when you call nlp() . The default pipeline is defined in a JSON file associated with whichever preexisting model you’re using ( en_core_web_sm for this tutorial), but you can also build one from scratch if you wish.

Note: To learn more about creating your own language processing pipelines, check out the spaCy pipeline documentation .

What does this have to do with classification? One of the built-in pipeline components that spaCy provides is called textcat (short for TextCategorizer ), which enables you to assign categories (or labels ) to your text data and use that as training data for a neural network.

This process will generate a trained model that you can then use to predict the sentiment of a given piece of text. To take advantage of this tool, you’ll need to do the following steps:

  • Add the textcat component to the existing pipeline.
  • Add valid labels to the textcat component.
  • Load, shuffle, and split your data.
  • Train the model, evaluating on each training loop.
  • Use the trained model to predict the sentiment of non-training data.
  • Optionally, save the trained model.

Note: You can see an implementation of these steps in the spaCy documentation examples . This is the main way to classify text in spaCy, so you’ll notice that the project code draws heavily from this example.

In the next section, you’ll learn how to put all these pieces together by building your own project: a movie review sentiment analyzer.

Building Your Own NLP Sentiment Analyzer

From the previous sections, you’ve probably noticed four major stages of building a sentiment analysis pipeline:

  • Loading data
  • Preprocessing
  • Training the classifier
  • Classifying data

For building a real-life sentiment analyzer, you’ll work through each of the steps that compose these stages. You’ll use the Large Movie Review Dataset compiled by Andrew Maas to train and test your sentiment analyzer. Once you’re ready, proceed to the next section to load your data.

If you haven’t already, download and extract the Large Movie Review Dataset. Spend a few minutes poking around, taking a look at its structure, and sampling some of the data. This will inform how you load the data. For this part, you’ll use spaCy’s textcat example as a rough guide.

You can (and should) decompose the loading stage into concrete steps to help plan your coding. Here’s an example:

  • Load text and labels from the file and directory structures.
  • Shuffle the data.
  • Split the data into training and test sets.
  • Return the two sets of data.

This process is relatively self-contained, so it should be its own function at least. In thinking about the actions that this function would perform, you may have thought of some possible parameters.

Since you’re splitting data, the ability to control the size of those splits may be useful, so split is a good parameter to include. You may also wish to limit the total amount of documents you process with a limit parameter. You can open your favorite editor and add this function signature:

With this signature, you take advantage of Python 3’s type annotations to make it absolutely clear which types your function expects and what it will return.

The parameters here allow you to define the directory in which your data is stored as well as the ratio of training data to test data. A good ratio to start with is 80 percent of the data for training data and 20 percent for test data. All of this and the following code, unless otherwise specified, should live in the same file.

Next, you’ll want to iterate through all the files in this dataset and load them into a list:

While this may seem complicated, what you’re doing is constructing the directory structure of the data, looking for and opening text files, then appending a tuple of the contents and a label dictionary to the reviews list.

The label dictionary structure is a format required by the spaCy model during the training loop, which you’ll see soon.

Note: Throughout this tutorial and throughout your Python journey, you’ll be reading and writing files . This is a foundational skill to master, so make sure to review it while you work through this tutorial.

Since you have each review open at this point, it’s a good idea to replace the <br /> HTML tags in the texts with newlines and to use .strip() to remove all leading and trailing whitespace.

For this project, you won’t remove stop words from your training data right away because it could change the meaning of a sentence or phrase, which could reduce the predictive power of your classifier. This is dependent somewhat on the stop word list that you use.

After loading the files, you want to shuffle them. This works to eliminate any possible bias from the order in which training data is loaded. Since the random module makes this easy to do in one line, you’ll also see how to split your shuffled data:

Here, you shuffle your data with a call to random.shuffle() . Then you optionally truncate and split the data using some math to convert the split to a number of items that define the split boundary. Finally, you return two parts of the reviews list using list slices.

Here’s a sample output, truncated for brevity:

To learn more about how random works, take a look at Generating Random Data in Python (Guide) .

Note: The makers of spaCy have also released a package called thinc that, among other features, includes simplified access to large datasets, including the IMDB review dataset you’re using for this project.

You can find the project on GitHub . If you investigate it, look at how they handle loading the IMDB dataset and see what overlaps exist between their code and your own.

Now that you’ve got your data loader built and have some light preprocessing done, it’s time to build the spaCy pipeline and classifier training loop.

Putting the spaCy pipeline together allows you to rapidly build and train a convolutional neural network (CNN) for classifying text data. While you’re using it here for sentiment analysis, it’s general enough to work with any kind of text classification task as long as you provide it with the training data and labels.

In this part of the project, you’ll take care of three steps:

  • Modifying the base spaCy pipeline to include the textcat component
  • Building a training loop to train the textcat component
  • Evaluating the progress of your model training after a given number of training loops

First, you’ll add textcat to the default spaCy pipeline.

Modifying the spaCy Pipeline to Include textcat

For the first part, you’ll load the same pipeline as you did in the examples at the beginning of this tutorial, then you’ll add the textcat component if it isn’t already present. After that, you’ll add the labels that your data uses ( "pos" for positive and "neg" for negative) to textcat . Once that’s done, you’ll be ready to build the training loop:

If you’ve looked at the spaCy documentation’s textcat example already, then this should look pretty familiar. First, you load the built-in en_core_web_sm pipeline, then you check the .pipe_names attribute to see if the textcat component is already available.

If it isn’t, then you create the component (also called a pipe ) with .create_pipe() , passing in a configuration dictionary. There are a few options that you can work with described in the TextCategorizer documentation .

Finally, you add the component to the pipeline using .add_pipe() , with the last parameter signifying that this component should be added to the end of the pipeline.

Next, you’ll handle the case in which the textcat component is present and then add the labels that will serve as the categories for your text:

If the component is present in the loaded pipeline, then you just use .get_pipe() to assign it to a variable so you can work on it. For this project, all that you’ll be doing with it is adding the labels from your data so that textcat knows what to look for. You’ll do that with .add_label() .

You’ve created the pipeline and prepared the textcat component for the labels it will use for training. Now it’s time to write the training loop that will allow textcat to categorize movie reviews.

Build Your Training Loop to Train textcat

To begin the training loop, you’ll first set your pipeline to train only the textcat component, generate batches of data for it with spaCy’s minibatch() and compounding() utilities, and then go through them and update your model.

A batch is just a subset of your data. Batching your data allows you to reduce the memory footprint during training and more quickly update your hyperparameters.

Note: Compounding batch sizes is a relatively new technique and should help speed up training. You can learn more about compounding batch sizes in spaCy’s training tips .

Here’s an implementation of the training loop described above:

On lines 25 to 27, you create a list of all components in the pipeline that aren’t the textcat component. You then use the nlp.disable() context manager to disable those components for all code within the context manager’s scope.

Now you’re ready to add the code to begin training:

Here, you call nlp.begin_training() , which returns the initial optimizer function. This is what nlp.update() will use to update the weights of the underlying model.

You then use the compounding() utility to create a generator, giving you an infinite series of batch_sizes that will be used later by the minibatch() utility.

Now you’ll begin training on batches of data:

Now, for each iteration that is specified in the train_model() signature, you create an empty dictionary called loss that will be updated and used by nlp.update() . You also shuffle the training data and split it into batches of varying size with minibatch() .

For each batch, you separate the text and labels, then fed them, the empty loss dictionary, and the optimizer to nlp.update() . This runs the actual training on each example.

The dropout parameter tells nlp.update() what proportion of the training data in that batch to skip over. You do this to make it harder for the model to accidentally just memorize training data without coming up with a generalizable model.

This will take some time, so it’s important to periodically evaluate your model. You’ll do that with the data that you held back from the training set, also known as the holdout set .

Evaluating the Progress of Model Training

Since you’ll be doing a number of evaluations, with many calculations for each one, it makes sense to write a separate evaluate_model() function. In this function, you’ll run the documents in your test set against the unfinished model to get your model’s predictions and then compare them to the correct labels of that data.

Using that information, you’ll calculate the following values:

True positives are documents that your model correctly predicted as positive. For this project, this maps to the positive sentiment but generalizes in binary classification tasks to the class you’re trying to identify.

False positives are documents that your model incorrectly predicted as positive but were in fact negative.

True negatives are documents that your model correctly predicted as negative.

False negatives are documents that your model incorrectly predicted as negative but were in fact positive.

Because your model will return a score between 0 and 1 for each label, you’ll determine a positive or negative result based on that score. From the four statistics described above, you’ll calculate precision and recall, which are common measures of classification model performance:

Precision is the ratio of true positives to all items your model marked as positive (true and false positives). A precision of 1.0 means that every review that your model marked as positive belongs to the positive class.

Recall is the ratio of true positives to all reviews that are actually positive, or the number of true positives divided by the total number of true positives and false negatives.

The F-score is another popular accuracy measure, especially in the world of NLP. Explaining it could take its own article, but you’ll see the calculation in the code. As with precision and recall, the score ranges from 0 to 1, with 1 signifying the highest performance and 0 the lowest.

For evaluate_model() , you’ll need to pass in the pipeline’s tokenizer component, the textcat component, and your test dataset:

In this function, you separate reviews and their labels and then use a generator expression to tokenize each of your evaluation reviews, preparing them to be passed in to textcat . The generator expression is a nice trick recommended in the spaCy documentation that allows you to iterate through your tokenized reviews without keeping every one of them in memory.

You then use the score and true_label to determine true or false positives and true or false negatives. You then use those to calculate precision, recall, and f-score. Now all that’s left is to actually call evaluate_model() :

Here you add a print statement to help organize the output from evaluate_model() and then call it with the .use_params() context manager in order to use the model in its current state. You then call evaluate_model() and print the results.

Once the training process is complete, it’s a good idea to save the model you just trained so that you can use it again without training a new model. After your training loop, add this code to save the trained model to a directory called model_artifacts located within your working directory:

This snippet saves your model to a directory called model_artifacts so that you can make tweaks without retraining the model. Your final training function should look like this:

In this section, you learned about training a model and evaluating its performance as you train it. You then built a function that trains a classification model on your input data.

Now that you have a trained model, it’s time to test it against a real review. For the purposes of this project, you’ll hardcode a review, but you should certainly try extending this project by reading reviews from other sources, such as files or a review aggregator’s API.

The first step with this new function will be to load the previously saved model. While you could use the model in memory, loading the saved model artifact allows you to optionally skip training altogether, which you’ll see later. Here’s the test_model() signature along with the code to load your saved model:

In this code, you define test_model() , which includes the input_data parameter. You then load your previously saved model.

The IMDB data you’re working with includes an unsup directory within the training data directory that contains unlabeled reviews you can use to test your model. Here’s one such review. You should save it (or a different one of your choosing) in a TEST_REVIEW constant at the top of your file:

Next, you’ll pass this review into your model to generate a prediction, prepare it for display, and then display it to the user:

In this code, you pass your input_data into your loaded_model , which generates a prediction in the cats attribute of the parsed_text variable. You then check the scores of each sentiment and save the highest one in the prediction variable.

You then save that sentiment’s score to the score variable. This will make it easier to create human-readable output, which is the last line of this function.

You’ve now written the load_data() , train_model() , evaluate_model() , and test_model() functions. That means it’s time to put them all together and train your first model.

So far, you’ve built a number of independent functions that, taken together, will load data and train, evaluate, save, and test a sentiment analysis classifier in Python.

There’s one last step to make these functions usable, and that is to call them when the script is run. You’ll use the if __name__ == "__main__": idiom to accomplish this:

Here you load your training data with the function you wrote in the Loading and Preprocessing Data section and limit the number of reviews used to 2500 total. You then train the model using the train_model() function you wrote in Training Your Classifier and, once that’s done, you call test_model() to test the performance of your model.

Note: With this number of training examples, training can take ten minutes or longer, depending on your system. You can reduce the training set size for a shorter training time, but you’ll risk having a less accurate model.

What did your model predict? Do you agree with the result? What happens if you increase or decrease the limit parameter when loading the data? Your scores and even your predictions may vary, but here’s what you should expect your output to look like:

As your model trains, you’ll see the measures of loss, precision, and recall and the F-score for each training iteration. You should see the loss generally decrease. The precision, recall, and F-score will all bounce around, but ideally they’ll increase. Then you’ll see the test review, sentiment prediction, and the score of that prediction—the higher the better.

You’ve now trained your first sentiment analysis machine learning model using natural language processing techniques and neural networks with spaCy! Here are two charts showing the model’s performance across twenty training iterations. The first chart shows how the loss changes over the course of training:

Loss over training iterations

While the above graph shows loss over time, the below chart plots the precision, recall, and F-score over the same training period:

The precision, recall, and f-score of the model over training iterations

In these charts, you can see that the loss starts high but drops very quickly over training iterations. The precision, recall, and F-score are pretty stable after the first few training iterations. What could you tinker with to improve these values?

Congratulations on building your first sentiment analysis model in Python! What did you think of this project? Not only did you build a useful tool for data analysis, but you also picked up on a lot of the fundamental concepts of natural language processing and machine learning.

In this tutorial, you learned how to:

  • Use natural language processing techniques
  • Use a machine learning classifier to determine the sentiment of processed text data
  • Build your own NLP pipeline with spaCy

You now have the basic toolkit to build more models to answer any research questions you might have. If you’d like to review what you’ve learned, then you can download and experiment with the code used in this tutorial at the link below:

What else could you do with this project? See below for some suggestions.

This is a core project that, depending on your interests, you can build a lot of functionality around. Here are a few ideas to get you started on extending this project:

The data-loading process loads every review into memory during load_data() . Can you make it more memory efficient by using generator functions instead?

Rewrite your code to remove stop words during preprocessing or data loading. How does the mode performance change? Can you incorporate this preprocessing into a pipeline component instead?

Use a tool like Click to generate an interactive command-line interface .

Deploy your model to a cloud platform like AWS and wire an API to it. This can form the basis of a web-based tool.

Explore the configuration parameters for the textcat pipeline component and experiment with different configurations.

Explore different ways to pass in new reviews to generate predictions.

Parametrize options such as where to save and load trained models, whether to skip training or train a new model, and so on.

This project uses the Large Movie Review Dataset , which is maintained by Andrew Maas . Thanks to Andrew for making this curated dataset widely available for use.

🐍 Python Tricks 💌

Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

Python Tricks Dictionary Merge

About Kyle Stratis

Kyle Stratis

Kyle is a self-taught developer working as a senior data engineer at Vizit Labs. In the past, he has founded DanqEx (formerly Nasdanq: the original meme stock exchange) and Encryptid Gaming.

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

Aldren Santos

Master Real-World Python Skills With Unlimited Access to Real Python

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

What Do You Think?

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students. Get tips for asking good questions and get answers to common questions in our support portal . Looking for a real-time conversation? Visit the Real Python Community Chat or join the next “Office Hours” Live Q&A Session . Happy Pythoning!

Keep Learning

Related Topics: intermediate data-science machine-learning

Keep reading Real Python by creating a free account or signing in:

Already have an account? Sign-In

Almost there! Complete this form and click the button below to gain instant access:

Sentiment Analysis (Source Code)

🔒 No spam. We take your privacy seriously.

text movie review dataset

text movie review dataset

IMDB movie review sentiment classification dataset

Load_data function.

Loads the IMDB dataset .

This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a list of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

As a convention, "0" does not stand for a specific word, but instead is used to encode the pad token.

  • path : where to cache the data (relative to ~/.keras/dataset ).
  • num_words : integer or None. Words are ranked by how often they occur (in the training set) and only the num_words most frequent words are kept. Any less frequent word will appear as oov_char value in the sequence data. If None, all words are kept. Defaults to None .
  • skip_top : skip the top N most frequently occurring words (which may not be informative). These words will appear as oov_char value in the dataset. When 0, no words are skipped. Defaults to 0 .
  • maxlen : int or None. Maximum sequence length. Any longer sequence will be truncated. None, means no truncation. Defaults to None .
  • seed : int. Seed for reproducible data shuffling.
  • start_char : int. The start of a sequence will be marked with this character. 0 is usually the padding character. Defaults to 1 .
  • oov_char : int. The out-of-vocabulary character. Words that were cut out because of the num_words or skip_top limits will be replaced with this character.
  • index_from : int. Index actual words with this index and higher.
  • Tuple of Numpy arrays : (x_train, y_train), (x_test, y_test) .

x_train , x_test : lists of sequences, which are lists of indexes (integers). If the num_words argument was specific, the maximum possible index value is num_words - 1 . If the maxlen argument was specified, the largest possible sequence length is maxlen .

y_train , y_test : lists of integer labels (1 or 0).

Note : The 'out of vocabulary' character is only used for words that were present in the training set but are not included because they're not making the num_words cut here. Words that were not seen in the training set but are in the test set have simply been skipped.

get_word_index function

Retrieves a dict mapping words to their index in the IMDB dataset.

The word index dictionary. Keys are word strings, values are their index.

  • Español – América Latina
  • Português – Brasil
  • Tiếng Việt

Text Classification with Movie Reviews

This notebook classifies movie reviews as positive or negative using the text of the review. This is an example of binary —or two-class—classification, an important and widely applicable kind of machine learning problem.

We'll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database . These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced , meaning they contain an equal number of positive and negative reviews.

This notebook uses tf.keras , a high-level API to build and train models in TensorFlow, and TensorFlow Hub , a library and platform for transfer learning. For a more advanced text classification tutorial using tf.keras , see the MLCC Text Classification Guide .

More models

Here you can find more expressive or performant models that you could use to generate the text embedding.

Download the IMDB dataset

The IMDB dataset is available on TensorFlow datasets . The following code downloads the IMDB dataset to your machine (or the colab runtime):

Explore the data

Let's take a moment to understand the format of the data. Each example is a sentence representing the movie review and a corresponding label. The sentence is not preprocessed in any way. The label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.

Let's print first 10 examples.

Let's also print the first 10 labels.

Build the model

The neural network is created by stacking layers—this requires three main architectural decisions:

  • How to represent the text?
  • How many layers to use in the model?
  • How many hidden units to use for each layer?

In this example, the input data consists of sentences. The labels to predict are either 0 or 1.

One way to represent the text is to convert sentences into embeddings vectors. We can use a pre-trained text embedding as the first layer, which will have two advantages:

  • we don't have to worry about text preprocessing,
  • we can benefit from transfer learning.

For this example we will use a model from TensorFlow Hub called google/nnlm-en-dim50/2 .

There are two other models to test for the sake of this tutorial:

  • google/nnlm-en-dim50-with-normalization/2 - same as google/nnlm-en-dim50/2 , but with additional text normalization to remove punctuation. This can help to get better coverage of in-vocabulary embeddings for tokens on your input text.
  • google/nnlm-en-dim128-with-normalization/2 - A larger model with an embedding dimension of 128 instead of the smaller 50.

Let's first create a Keras layer that uses a TensorFlow Hub model to embed the sentences, and try it out on a couple of input examples. Note that the output shape of the produced embeddings is a expected: (num_examples, embedding_dimension) .

Let's now build the full model:

The layers are stacked sequentially to build the classifier:

  • The first layer is a TensorFlow Hub layer. This layer uses a pre-trained Saved Model to map a sentence into its embedding vector. The model that we are using ( google/nnlm-en-dim50/2 ) splits the sentence into tokens, embeds each token and then combines the embedding. The resulting dimensions are: (num_examples, embedding_dimension) .
  • This fixed-length output vector is piped through a fully-connected ( Dense ) layer with 16 hidden units.
  • The last layer is densely connected with a single output node. This outputs logits: the log-odds of the true class, according to the model.

Hidden units

The above model has two intermediate or "hidden" layers, between the input and output. The number of outputs (units, nodes, or neurons) is the dimension of the representational space for the layer. In other words, the amount of freedom the network is allowed when learning an internal representation.

If a model has more hidden units (a higher-dimensional representation space), and/or more layers, then the network can learn more complex representations. However, it makes the network more computationally expensive and may lead to learning unwanted patterns—patterns that improve performance on training data but not on the test data. This is called overfitting , and we'll explore it later.

Loss function and optimizer

A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), we'll use the binary_crossentropy loss function.

This isn't the only choice for a loss function, you could, for instance, choose mean_squared_error . But, generally, binary_crossentropy is better for dealing with probabilities—it measures the "distance" between probability distributions, or in our case, between the ground-truth distribution and the predictions.

Later, when we are exploring regression problems (say, to predict the price of a house), we will see how to use another loss function called mean squared error.

Now, configure the model to use an optimizer and a loss function:

Create a validation set

When training, we want to check the accuracy of the model on data it hasn't seen before. Create a validation set by setting apart 10,000 examples from the original training data. (Why not use the testing set now? Our goal is to develop and tune our model using only the training data, then use the test data just once to evaluate our accuracy).

Train the model

Train the model for 40 epochs in mini-batches of 512 samples. This is 40 iterations over all samples in the x_train and y_train tensors. While training, monitor the model's loss and accuracy on the 10,000 samples from the validation set:

Evaluate the model

And let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.

This fairly naive approach achieves an accuracy of about 87%. With more advanced approaches, the model should get closer to 95%.

Create a graph of accuracy and loss over time

model.fit() returns a History object that contains a dictionary with everything that happened during training:

There are four entries: one for each monitored metric during training and validation. We can use these to plot the training and validation loss for comparison, as well as the training and validation accuracy:

png

In this plot, the dots represent the training loss and accuracy, and the solid lines are the validation loss and accuracy.

Notice the training loss decreases with each epoch and the training accuracy increases with each epoch. This is expected when using a gradient descent optimization—it should minimize the desired quantity on every iteration.

This isn't the case for the validation loss and accuracy—they seem to peak after about twenty epochs. This is an example of overfitting: the model performs better on the training data than it does on data it has never seen before. After this point, the model over-optimizes and learns representations specific to the training data that do not generalize to test data.

For this particular case, we could prevent overfitting by simply stopping the training after twenty or so epochs. Later, you'll see how to do this automatically with a callback.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2023-12-08 UTC.

logo

Natural Language Processing Lecture

8.4. cnn, lstm and attention for imdb movie review classification ¶.

Author: Johannes Maucher

Last Update: 23.11.2020

The IMDB Movie Review corpus is a standard dataset for the evaluation of text-classifiers. It consists of 25000 movies reviews from IMDB, labeled by sentiment (positive/negative). In this notebook a Convolutional Neural Network (CNN) is implemented for sentiment classification of IMDB reviews.

8.4.1. Access IMDB dataset ¶

The IMDB dataset is already available in Keras and can easily be accessed by

imdb.load_data() .

The returned dataset contains the sequence of word indices for each review.

The representation of text as sequence of integers is good for Machine Learning algorithms, but useless for human text understanding. Therefore, we also access the word-index from Keras IMDB dataset , which maps words to the associated integer-IDs. Since we like to map integer-IDs to words we calculate the inverse wordindex inv_wordindex :

The first film-review of the training-partition then reads as follows:

Next the distribution of review-lengths (words per review) is calculated:

../_images/04CNN_15_0.png

8.4.2. Preparing Text Sequences and Labels ¶

All sequences must be padded to unique length of MAX_SEQUENCE_LENGTH . This means, that longer sequences are cut and shorter sequences are filled with zeros. For this Keras provides the pad_sequences() -function.

Moreover, all class-labels must be represented in one-hot-encoded form:

8.4.3. CNN with 2 Convolutional Layers ¶

The first network architecture consists of

an embedding layer. This layer takes sequences of integers and learns word-embeddings. The sequences of word-embeddings are then passed to the first convolutional layer

two 1D-convolutional layers with different number of filters and different filter-sizes

two Max-Pooling layers to reduce the number of neurons, required in the following layers

a MLP classifier with one hidden layer and the output layer

8.4.3.1. Prepare Embedding Matrix and -Layer ¶

8.4.3.2. define cnn architecture ¶, 8.4.3.3. train network ¶.

../_images/04CNN_30_0.png

As shown above, after 6 epochs of training the cross-entropy-loss is 0.475 and the accuracy is 87.11%. However, it seems that the accuracy-value after 3 epochs has been higher, than the accuracy after 6 epochs. This indicates overfitting due to too long learning.

8.4.4. CNN with different filter sizes in one layer ¶

In Y. Kim; Convolutional Neural Networks for Sentence Classification a CNN with different filter-sizes in one layer has been proposed. This CNN is implemented below:

KimCnn

Source: Y. Kim; Convolutional Neural Networks for Sentence Classification

8.4.4.1. Prepare Embedding Matrix and -Layer ¶

8.4.4.2. define architecture ¶, 8.4.4.3. train network ¶.

../_images/04CNN_43_0.png

As shown above, after 8 epochs of training the cross-entropy-loss is 0.467 and the accuracy is 88.47%.

8.4.5. LSTM ¶

../_images/04CNN_53_0.png

As shown above, after 6 epochs of training the cross-entropy-loss is 0.467 and the accuracy is 86.7%. However, it seems that the accuracy-value after 2 epochs has been higher, than the accuracy after 6 epochs. This indicates overfitting due to too long learning.

8.4.6. Bidirectional LSTM architecture with Attention ¶

8.4.6.1. define custom attention layer ¶.

Since Keras does not provide an attention-layer, we have to implement this type on our own. The implementation below corresponds to the attention-concept as introduced in Bahdanau et al: Neural Machine Translation by Jointly Learning to Align and Translate .

The general concept of writing custom Keras layers is described in the corresponding Keras documentation .

Any custom layer class inherits from the layer-class and must implement three methods:

build(input_shape) : this is where you will define your weights. This method must set self.built = True , which can be done by calling super([Layer], self).build() .

call(x) : this is where the layer’s logic lives. Unless you want your layer to support masking, you only have to care about the first argument passed to call: the input tensor.

compute_output_shape(input_shape) : in case your layer modifies the shape of its input, you should specify here the shape transformation logic. This allows Keras to do automatic shape inference.

../_images/04CNN_63_0.png

Again, the achieved accuracy is in the same range as for the other architectures. None of the architectures has been optimized, e.g. through hyperparameter-tuning. However, the goal of this notebook is not the determination of an optimal model, but the demonstration of how modern neural network architectures can be implemented for text-classification.

Sentiment Classification on the Large Movie Review Dataset

Data mining project, bert sentiment classification.

  • Monticone Pietro
  • Moroni Claudio
  • Orsenigo Davide

Problem: Sentiment Classification

A sentiment classification problem consists, roughly speaking, in detecting a piece of text and predicting if the author likes or dislikes what he/she is talking about: the input X is a piece of text and the output Y is the sentiment we want to predict, such as the rating of a movie review.

If we can train a model to map X to Y based on a labelled dataset then it can be used to predict sentiment of a reviewer after watching a movie.

Data: Large Movie Review Dataset v1.0

The dataset contains movie reviews along with their associated binary sentiment polarity labels.

  • The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets.
  • The overall distribution of labels is balanced (25k pos and 25k neg).
  • 50,000 unlabeled documents for unsupervised learning are included, but they won’t be used.
  • The train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels.
  • In the labeled train/test sets, a negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets.
  • In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and ≤ 5.

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis . The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

Theoretical introduction

The encoder-decoder sequence.

Roughly speaking, an encoder-decoder sequence is an ordered collection of steps ( coders ) designed to automatically translate sentences from a language to another (e.g. the English “the pen is on the table” into the Italian “la penna è sul tavolo”), which could be useful to visualize as follows: input sentence → ( encoders ) → ( decoders ) → output/translated sentence .

For our practical purpose, encoders and decoders are effectively indistinguishable (that’s why we will call them coders ): both are composed of two layers: a LSTM or GRU neural network and an attention module (AM) . They only differ in the way in which their output is processed.

LSTM or GRU neural network

Both the input and the output of an LSTM/GRU neural network consists of two vectors:

  • the hidden state : the representation of what the network has learnt about the sentence it’s reading;
  • the prediction : the representation of what the network predicts (e.g. translation).

Each word in the English input sentence is translated into its word embedding vector (WEV) before being processed by the first coder (e.g. with word2vec ). The WEV of the first word of the sentence and a random hidden state are processed by the first coder of the sequence. Regarding the output: the prediction is ignored, while the hidden state and the WEV of the second word are passed as input into the second coder and so on to the last word of the sentence. Therefore in this phase the coders work as encoders .

At the end of the sequence of N encoders (N being the number of words in the input sentence), the decoding phase begins:

  • the last hidden state and the WEV of the “START” token are passed to the first decoder ;
  • the decoder outputs a hidden state and a prection;
  • the hidden state and the prediction are passed to the second decoder;
  • the second decoder outputs a new hidden state and the second word of the translated/output sentence

and so on up until the whole sentence has been translated, namely when a decoder of the sequence outputs the WEV of the “END” token. Then there is an external mechanism to convert prediction vectors into real words, so it’s very importance to notice that the only purpose of decoders is to predict the next word .

Attention module (AM)

The attention module is a further layer that is placed before the network which provides the collection of words of the sentence with a relational structure. Let’s consider the word “table” in the sentence used as an exampe above. Because of the AM, the encoder will weight the preposition “on” (processed by the previous encoder) more than the article “the” which refers to the subject “cat”.

Bidirectional Encoder Representations from Transformers (BERT)

Transformer.

The transformer is a coder endowed with the AM layer. Transformers have been observed to work much better than the basic encoder-decoder sequences.

BERT is a sequence of encoder-type transformers which was pre-trained to predict a word or sentence (i.e. used as decoder). The benefit of improved performance of Transformers comes at a cost: the loss of bidirectionality , which is the ability to predict both next word and the previous one. BERT is the solution to this problem, a Tranformer which preserves biderectionality .

The first token is not “START”. In order to use BERT as a pre-trained language model for sentence-classification, we need to input the BERT prediction of “CLS” into a linear regression because

  • the model has been trained to predict the next sentence, not just the next word;
  • the semantic information of the sentence is encoded in the prediction output of “CLS” as a document vector of 512 elements.

text movie review dataset

  • bert_final_data
  • https://www.kaggle.com/dataset/5f1193b4685a6e3aa8b72fa3fdc427d18c3568c66734d60cf8f79f2607551a38
  • https://www.kaggle.com/dataset/9850d2e4b7d095e2b723457263fbef547437b159e3eb7ed6dc2e88c7869fca0b
  • Bert-For-Tf2
  • Google github repository
  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • A Visual Guide to Using BERT for the First Time
  • Machine Translation(Encoder-Decoder Model)!
  • The Illustarted Tranformers
  • The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
  • BERT Explained: State of the art language model for NLP
  • Learning Word Vectors for Sentiment Analysis .
dataset_imdb {textdata}R Documentation

IMDB Large Movie Review Dataset

Description.

The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg).

Character, path to directory where data will be stored. If , will be used to determine path.

Character. Return training ("train") data or testing ("test") data. Defaults to "train".

Logical, set to delete dataset.

Logical, set to return the path of the dataset.

Logical, set to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.

Logical, set if you have manually downloaded the file and placed it in the folder designated by running this function with .

In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.

When using this dataset, please cite the ACL 2011 paper

InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142–150}, url = {http://www.aclweb.org/anthology/P11-1015} }

A tibble with 25,000 rows and 2 variables:

Character, denoting the sentiment

Character, text of the review

http://ai.stanford.edu/~amaas/data/sentiment/

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Saroswat/Text-Classification-using-RNN-using-IMDB-Dataset

Folders and files.

NameName
3 Commits

Repository files navigation

Text classification using rnn on imdb dataset.

This text classification tutorial demonstrates the implementation of a Recurrent Neural Network (RNN) on the IMDB large movie review dataset for sentiment analysis. The dataset comprises movie reviews labeled as either positive or negative sentiment.

The code showcases:

  • Setup and initialization using TensorFlow and TensorFlow Datasets (TFDS).
  • Preprocessing of the IMDB dataset for binary sentiment classification.
  • Building an RNN-based model using TensorFlow/Keras for sentiment analysis.
  • Model training, evaluation, and visualization of training metrics.

This code requires TensorFlow and TensorFlow Datasets. Use the provided setup to install the necessary packages.

Input Pipeline

The dataset is split into training and test sets and processed using TensorFlow Datasets. The code demonstrates:

  • Dataset loading with tfds.load .
  • Shuffle and batch setup for training and test datasets.
  • Visualization of text and label pairs.

Text Encoding

The raw text from the dataset is preprocessed using the TextVectorization layer. This layer adapts to the text and encodes it into indices for model input. The process involves setting vocabulary size, encoding text to indices, and reversing the encoding.

Model Architecture

The model architecture consists of the following layers:

  • TextVectorization layer for encoding text.
  • Embedding layer for word representation.
  • Bidirectional LSTM layer for sequence processing.
  • Dense layers for final classification.

Training and Evaluation

The code compiles and trains the model using a binary cross-entropy loss function and Adam optimizer. It tracks training accuracy, loss, and evaluates model performance on the test set.

Additional Techniques

Demonstrates the use of stacking multiple LSTM layers in the model architecture for improved performance. It visualizes training metrics using Matplotlib.

  • Setup : Install required packages.
  • Execution : Run code blocks sequentially to observe the training process and model evaluation.
  • Model Customization : Explore changing the model architecture or hyperparameters for different results.
  • Visualizations : Analyze training and validation metric plots to understand model performance.

Sample Predictions

The code includes examples of predicting sentiment for custom input sentences using the trained model.

  • Jupyter Notebook 100.0%

noplace, a mashup of Twitter and Myspace for Gen Z, hits No. 1 on the App Store

collage of Noplace screens

Aiming to bring the “social” back to “social media,” a new app called noplace has surged to the top of the App Store as it launched out of invite-only mode Wednesday. Designed to appeal to a younger crowd — or anyone who wants to connect with friends or around shared interests — noplace is like a modern-day Myspace with its colorful, customizable profiles that allow people to share everything from relationship status, to what they’re listening to or watching, what they’re reading or doing, and more.

Boding well for its potential in the often-difficult consumer social market, noplace had already gone viral ahead of its public launch because of its feature that allows users to express themselves by customizing the colors of their profile. Though Gen Z may not have grown up with Myspace and all its chaotic customizations, there’s still a sense of nostalgia for a social networking experience they never had.

“I think that part of the magical, fun part of the internet is gone now. Everything is very uniform,” says founder and CEO Tiffany Zhong , who previously founded her own early-stage consumer fund, Pineapple Capital, and, in her teens, worked at Binary Capital, helping them source early-stage consumer deals.

text movie review dataset

Having played with every consumer social app over the past decade, Zhong has a good eye for the next big hit. She flagged Musical.ly in 2015 as the startup that would become the next Snap or Twitter, for instance, after realizing how much traction it had with kids and other younger users.

She also often tweeted her product insights and analysis, particularly about consumer apps, gaining her a following on social media. Given her background, it’s no surprise that Zhong has well-developed ideas about what might appeal to today’s younger users in a new social networking app.

text movie review dataset

“I’ve always loved social,” she says, but added that social media doesn’t feel social anymore. “Everything is just media. It feels very disconnected.”

In part, that’s because all our content now is highly personalized, the founder says. “We’re watching different content and [following] different interests than our friends, so community is harder to find as a result,” she says.

With noplace, the idea is to provide a place where people can follow their friends as well as find others who share their interests in one place.

The app offers a mini, customizable profile where they can share what they’re up to right now and customize it to reflect their interests. Users’ profiles can feature tags, which the app calls “stars,” that are the interests or topics that they care about. For example, users might add their astrology sign, their Myers-Briggs personality type, their hobbies or their fandoms to their profiles, which then makes them discoverable to others. It even has a “top 10 friends” section, reminiscent of Myspace’s top 8.

But noplace is more like a global group chat or Twitter/X rival than it is an alternative to Facebook, as it focuses on text-based updates and doesn’t support either photos or videos for the time being.

text movie review dataset

“Facebook 10 years ago — or Facebook when I was using it in middle school — was all around cool, life updates,” Zhong says. “We don’t get that anymore, right? You can follow [friends] on Instagram, but it’s still highlights, less updates.”

Also on noplace, users are meant to share what they’re currently doing, not what they’ve already done. If you’re in a new city or watching a show or checking out a new band, those could be your status updates. The app offers two feeds, one with your friends and another global feed from everyone in the app, and both are in reverse chronological order. There are no private profiles.

People who enter their age as younger than 18 will also receive a more moderated feed. The company is focused on moderation, having built its own internal dashboard for the purpose, and is tasking a team to ensure users stay safe.

text movie review dataset

Instead of algorithms, noplace leverages AI technology to drive suggestions and curation. The app doesn’t edit the feed for you, but rather uses AI to do things like offering summaries of what you’ve missed.

“We did that intentionally … having a global, public feed is what makes it so fun. It’s like everyone’s brain on paper,” Zhong notes. “People have a blast. They’re like, ‘I’ve never had an app like this before.’”

The Tokyo- and San Francisco-based founder first started working on noplace during the second half of last year along with a remotely distributed full-time team of seven. Late last year, noplace launched into an invite-only beta phase and “accidentally went viral,” Zhong says, prompting the team to distribute some invite codes to early adopters, which included some K-pop fans.

The app is now poised to offer younger Twitter users an alternative to the network now known as X under Elon Musk, and offers the same ability to post to a text-based feed, but combines that with friend-finding features and customization options that appeal to their demographic.

The app is a free download on iOS and is available in read-only mode on the web. Monetization plans are not yet underway. noplace competes with other friend-finding apps targeting Gen Z, like Wizz, Yubo, purp, LMK and others.

The startup is backed by funding from investors including 776 (Alexis Ohanian), Forerunner Ventures and others. According to PitchBook data, the company raised $15 million in a Series A1 round, at a pre-money valuation of $75 million, bringing its total raise to north of $19 million.

More TechCrunch

Get the industry’s biggest tech news, techcrunch daily news.

Every weekday and Sunday, you can get the best of TechCrunch’s coverage.

Startups Weekly

Startups are the core of TechCrunch, so get our best coverage delivered weekly.

TechCrunch Fintech

The latest Fintech news and analysis, delivered every Tuesday.

TechCrunch Mobility

TechCrunch Mobility is your destination for transportation news and insight.

Featured Article

You could learn a lot from a CIO with a $17B IT budget

Lori Beer’s work is a case study for every CIO out there, most of whom will never come close to JP Morgan Chase’s scale, but who can still learn from how it goes about its business.

You could learn a lot from a CIO with a $17B IT budget

Tesla makes it onto Chinese government purchase list

For the first time, Chinese government workers will be able to purchase Tesla’s Model Y for official use. Specifically, officials in eastern China’s Jiangsu province included the Model Y in…

Tesla makes it onto Chinese government purchase list

Tokens are a big reason today’s generative AI falls short

Generative AI models don’t process text the same way humans do. Understanding their “token”-based internal environments may help explain some of their strange behaviors — and stubborn limitations. Most models,…

Tokens are a big reason today’s generative AI falls short

Apple approves Epic Games’ marketplace app after initial rejections

After multiple rejections, Apple has approved Fortnite maker Epic Games’ third-party app marketplace for launch in the EU. As now permitted by the EU’s Digital Markets Act (DMA), Epic announced…

Apple approves Epic Games’ marketplace app after initial rejections

OpenAI breach is a reminder that AI companies are treasure troves for hackers

There’s no need to worry that your secret ChatGPT conversations were obtained in a recently reported breach of OpenAI’s systems. The hack itself, while troubling, appears to have been superficial…

OpenAI breach is a reminder that AI companies are treasure troves for hackers

Space for newcomers, biotech going mainstream, and more

Welcome to Startups Weekly — TechCrunch’s weekly recap of everything you can’t miss from the world of startups. Sign up here to get it in your inbox every Friday. Most…

Space for newcomers, biotech going mainstream, and more

X plans to more deeply integrate Grok’s AI, app researcher finds

Elon Musk’s X is exploring more ways to integrate xAI’s Grok into the social networking app. According to a series of recent discoveries, X is developing new features like the…

X plans to more deeply integrate Grok’s AI, app researcher finds

Meet Brex, Google Cloud, Aerospace and more at Disrupt 2024

We’re about four months away from TechCrunch Disrupt 2024, taking place October 28 to 30 in San Francisco! We could not bring you this world-class event without our world-class partners…

Meet Brex, Google Cloud, Aerospace and more at Disrupt 2024

Amazon faces more EU scrutiny over recommender algorithms and ads transparency

In its latest step targeting a major marketplace, the European Commission sent Amazon another request for information (RFI) Friday in relation to its compliance under the bloc’s rulebook for digital…

Amazon faces more EU scrutiny over recommender algorithms and ads transparency

Quantum Rise grabs $15M seed for its AI-driven ‘Consulting 2.0’ startup

Quantum Rise, a Chicago-based startup that does AI-driven automation for companies like dunnhumby (a retail analytics platform for the grocery industry), has raised a $15 million seed round from Erie…

Quantum Rise grabs $15M seed for its AI-driven ‘Consulting 2.0’ startup

YouTube’s updated eraser tool removes copyrighted music without impacting other audio

On July 4, YouTube released an updated eraser tool for creators so they can easily remove any copyrighted music from their videos without affecting any other audio such as dialog…

YouTube’s updated eraser tool removes copyrighted music without impacting other audio

India’s Airtel dismisses data breach reports amid customer concerns

Airtel, India’s second-largest telecom operator, on Friday denied any breach of its systems following reports of an alleged security lapse that has caused concern among its customers. The telecom group,…

India’s Airtel dismisses data breach reports amid customer concerns

Spain’s exposure to climate change helps Madrid-based VC Seaya close €300M climate tech fund

According to a recent Dealroom report on the Spanish tech ecosystem, the combined enterprise value of Spanish startups surpassed €100 billion in 2023. In the latest confirmation of this upward trend, Madrid-based…

Spain’s exposure to climate change helps Madrid-based VC Seaya close €300M climate tech fund

Forestay, Europe’s newest $220M growth-stage VC fund, will focus on AI

Forestay, an emerging VC based out of Geneva, Switzerland, has been busy. This week it closed its second fund, Forestay Capital II, at a hard cap of $220 million. The…

Forestay, Europe’s newest $220M growth-stage VC fund, will focus on AI

A year later, what Threads could learn from other social networks

Threads, Meta’s alternative to Twitter, just celebrated its first birthday. After launching on July 5 last year, the social network has reached 175 million monthly active users — that’s a…

A year later, what Threads could learn from other social networks

J2 Ventures, focused on military healthcare, grabs $150M for its second fund

J2 Ventures, a firm led mostly by U.S. military veterans, announced on Thursday that it has raised a $150 million second fund. The Boston-based firm invests in startups whose products…

J2 Ventures, focused on military healthcare, grabs $150M for its second fund

HealthEquity says data breach is an ‘isolated incident’

HealthEquity said in an 8-K filing with the SEC that it detected “anomalous behavior by a personal use device belonging to a business partner.”

HealthEquity says data breach is an ‘isolated incident’

Roll20, an online tabletop role-playing game platform, discloses data breach

Roll20 said that on June 29 it had detected that a “bad actor” gained access to an account on the company’s administrative website for one hour.

Roll20, an online tabletop role-playing game platform, discloses data breach

Fisker asks bankruptcy court to sell its EVs at average of $14,000 each

Fisker has a willing buyer for its remaining inventory of all-electric Ocean SUVs, and has asked the Delaware Bankruptcy Court judge overseeing its Chapter 11 case to approve the sale.…

Fisker asks bankruptcy court to sell its EVs at average of $14,000 each

Fizz, the anonymous Gen Z social app, adds a marketplace for college students

Teddy Solomon just moved to a new house in Palo Alto, so he turned to the Stanford community on Fizz to furnish his room. “Every time I show up to…

Fizz, the anonymous Gen Z social app, adds a marketplace for college students

Why deep tech VC Driving Forces is shutting down

With increasing competition for what is, essentially, still a small number of hard tech and deep tech deals, Sidney Scott realized it would be a challenge for smaller funds like…

Why deep tech VC Driving Forces is shutting down

How to turn off those silly video call reactions on iPhone and Mac

A guide to turn off reactions on your iPhone and Mac so you don’t get surprised by effects during work video calls.

How to turn off those silly video call reactions on iPhone and Mac

Amazon retires its Astro for Business security robot after only 7 months

Amazon has decided to discontinue its Astro for Business device, a security robot for small- and medium-sized businesses, just seven months after launch.  In an email sent to customers and…

Amazon retires its Astro for Business security robot after only 7 months

This Week in AI: With Chevron’s demise, AI regulation seems dead in the water

Hiya, folks, and welcome to TechCrunch’s regular AI newsletter. This week in AI, the U.S. Supreme Court struck down “Chevron deference,” a 40-year-old ruling on federal agencies’ power that required…

This Week in AI: With Chevron’s demise, AI regulation seems dead in the water

Noplace had already gone viral ahead of its public launch because of its feature that allows users to express themselves by customizing the colors of their profile.

noplace, a mashup of Twitter and Myspace for Gen Z, hits No. 1 on the App Store

Cloudflare launches a tool to combat AI bots

Cloudflare analyzed AI bot and crawler traffic to fine-tune automatic bot detection models.

Cloudflare launches a tool to combat AI bots

Twilio says hackers identified cell phone numbers of two-factor app Authy users

Twilio says “threat actors were able to identify” phone numbers of people who use the two-factor app Authy.

Twilio says hackers identified cell phone numbers of two-factor app Authy users

Nano Dimension is buying Desktop Metal

The news brings closure to more than two years of volleying back and forth between some of the biggest names in additive manufacturing.

Nano Dimension is buying Desktop Metal

Groups save big at TechCrunch Disrupt 2024

Planning to attend TechCrunch Disrupt 2024 with your team? Maximize your team-building time and your company’s impact across the entire conference when you bring your team. Groups of 4 to…

Groups save big at TechCrunch Disrupt 2024

Music video-sharing app Popster uses generative AI and lets artists remix videos

As more music streaming apps and creation tools emerge to compete for users’ attention, social music-sharing app Popster is getting two new features to grow its user base: an AI…

Music video-sharing app Popster uses generative AI and lets artists remix videos

IMAGES

  1. Sample of movie review dataset.

    text movie review dataset

  2. Sample Data: Movie Review Sentence Polarity

    text movie review dataset

  3. Active Learning results on IMBD movie reviews dataset (a) text

    text movie review dataset

  4. Examples from the movie review dataset.

    text movie review dataset

  5. (PDF) An Experimental Analysis of Optimal Hybrid Word Embedding Methods

    text movie review dataset

  6. Exploring IMDB reviews in TensorFlow Datasets

    text movie review dataset

VIDEO

  1. Movie Recommendation System Using TMDB Api and Flask framework

  2. NLP Review Dataset pakai coleb Google

  3. Discover the Real prithviraj chauhan movie review

  4. IoT Data Quality Issues and Potential Solutions A Literature Review

  5. Machine Learning Project

  6. A simple tutorial on binary classification of IMDB dataset using neural networks in TensorFlow

COMMENTS

  1. IMDb Movie Reviews Dataset

    The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10.

  2. Large Movie Review Dataset

    Sentiment Analysis. Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed ...

  3. IMDB Dataset of 50K Movie Reviews

    About Dataset IMDB dataset having 50K movie reviews for natural language processing or Text analytics. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of ...

  4. imdb_reviews

    imdb_reviews. Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

  5. How to Prepare Movie Review Data for Sentiment Analysis (Text

    Movie Review Polarity Dataset (review_polarity.tar.gz, 3MB) After unzipping the file, you will have a directory called "txt_sentoken" with two sub-directories containing the text "neg" and "pos" for negative and positive reviews. Reviews are stored one per file with a naming convention cv000 to cv999 for each of neg and pos.

  6. IMDB Dataset of 50K Movie Reviews

    Large Movie Review Dataset. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion. 0 Active Events. expand_more.

  7. Sentiment Analysis of IMDB Movie Reviews

    Explore and run machine learning code with Kaggle Notebooks | Using data from IMDB Dataset of 50K Movie Reviews. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome ...

  8. MR Dataset

    MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.

  9. How to Predict Sentiment from Movie Reviews Using Deep Learning (Text

    The dataset is the Large Movie Review Dataset, often referred to as the IMDB dataset. The IMDB dataset contains 25,000 highly polar movie reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given movie review has a positive or negative sentiment.

  10. GitHub

    Ensure that the IMDb Movie Reviews dataset is downloaded and stored in the data directory. Preprocess the raw text data to clean and standardize it before feature extraction. Feature Extraction: Run the feature_extraction.py script to convert text data into numerical feature vectors using Bag-of-Words representation. Model Building:

  11. IMDB Large Movie Review Dataset

    The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). ... text. Character, text of the review. ... no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train ...

  12. datasets/docs/catalog/imdb_reviews.md at master

    Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

  13. Use Sentiment Analysis With Python to Classify Movie Reviews

    If you haven't already, download and extract the Large Movie Review Dataset. Spend a few minutes poking around, taking a look at its structure, and sampling some of the data. This will inform how you load the data. ... 0.000415042785704145 0.7926829267970453 0.8262711864056664 0.8091286306718204 Testing model Review text: ...

  14. IMDB movie review sentiment classification dataset

    Loads the IMDB dataset. This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a list of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most ...

  15. Text Classification with Movie Reviews

    This notebook classifies movie reviews as positive or negative using the text of the review. This is an example of binary—or two-class—classification, an important and widely applicable kind of machine learning problem.. We'll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database.These are split into 25,000 reviews for training and 25,000 ...

  16. 8.4. CNN, LSTM and Attention for IMDB Movie Review classification

    The IMDB Movie Review corpus is a standard dataset for the evaluation of text-classifiers. It consists of 25000 movies reviews from IMDB, labeled by sentiment (positive/negative). In this notebook a Convolutional Neural Network (CNN) is implemented for sentiment classification of IMDB reviews.

  17. Sentiment Classification on the Large Movie Review Dataset

    The dataset contains movie reviews along with their associated binary sentiment polarity labels. The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). 50,000 unlabeled documents for unsupervised learning are included, but they won't be used.

  18. qh21/Sentiment-Analysis-of-IMDB-Movie-Reviews

    Explore sentiment analysis on the IMDB movie reviews dataset using Python. This Jupyter Notebook showcases text preprocessing, TF-IDF feature extraction, and model training (Multinomial Naive Bayes, Random Forest) for sentiment classification. Ideal for understanding NLP basics and applying ML to textual data. - qh21/Sentiment-Analysis-of-IMDB-Movie-Reviews

  19. Sentiment Analysis on Movie Reviews

    Classify the sentiment of sentences from the Rotten Tomatoes dataset. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion. 0 Active Events. expand_more. menu ...

  20. R: IMDB Large Movie Review Dataset

    The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). ... no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of ...

  21. Saroswat/Text-Classification-using-RNN-using-IMDB-Dataset

    This text classification tutorial demonstrates the implementation of a Recurrent Neural Network (RNN) on the IMDB large movie review dataset for sentiment analysis. The dataset comprises movie reviews labeled as either positive or negative sentiment. ... The raw text from the dataset is preprocessed using the TextVectorization layer. This layer ...

  22. noplace, a mashup of Twitter and Myspace for Gen Z, hits No. 1 on the

    Having played with every consumer social app over the past decade, Zhong has a good eye for the next big hit. She flagged Musical.ly in 2015 as the startup that would become the next Snap or ...

  23. Awesome ML and Text Classification

    Explore and run machine learning code with Kaggle Notebooks | Using data from TMDB 5000 Movie Dataset. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion.

  24. Movie Review Dataset

    Movie Review Dataset. Movie Review Dataset. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion. 0 Active Events. expand_more. menu. Skip to