Data Science Mini Project : NLP on Reviews – Cleanup

This project is your first steps to deriving insights from the gold mine of data that is text, and your first foray in NLP. Learn how to process raw text and make wordclouds. This is part 2 in this series.

Project Features

Good for Intermediate Level
100 points for Enrolling in Project
200 points for Submitting Solution

5/5 (1) 29+ Enrolled Learners
2 Lessons

Project Problem Statement

In the first part, you loaded the data and made a wordcloud on the raw text, and the output wasn’t really insightful. We need to clean up a lot! In this microproject, you’ll prepare the data for clean up and for further processing.   You need to break the sentences into individual words, and create one big, consolidated list of words. Once you have this, further processing gets easy, as you’ll see in the next microproject. This is also called ‘tokenizing’. Figure out how you can break a sentence into individual words (there are several ways!). Its ok if you don’t get it at the very first time. Tasks:
  1. Break each sentence into constituent words (each sentence should become a list of terms)
  2. Combine individual lists to form a single big list of all the terms
Submit your solution as a py file or Jupyter notebook. Make sure to provide your insights as comments/markdown in the code.


  1. NLTK has word_tokenize function that helps you to break each sentences into individual words
  2. Use a loop to get individual lists of words together
We would like you to try it out first on your own. You will get solution of project after 1 week of enrollment. All the best!

Please rate this