
Feature Engineering
Feature engineering is a method for extracting new features from existing features. These new features are extracted as they tend to effectively explain variability in data. One application of feature engineering could be to calculate how similar different pieces of text are. There are various ways of calculating the similarity between two texts. The most popular methods are cosine similarity and Jaccard similarity. Let's learn about each of them:
- Cosine similarity: The cosine similarity between two texts is the cosine of the angle between their vector representations. BoW and TF-IDF matrices can be regarded as vector representations of texts.
- Jaccard similarity: This is the ratio of the number of terms common between two text documents to the total number of unique terms present in those texts.
Let's understand this with the help of an example. Suppose there are two texts:
Text 1: I like detective Byomkesh Bakshi.
Text 2: Byomkesh Bakshi is not a detective, he is a truth seeker.
The common terms are "Byomkesh," "Bakshi," and "detective."
The number of common terms in the texts is three.
The unique terms present in the texts are "I," "like," "is," "not," "a," "he," "is," "truth," "seeker."
The number of unique terms is nine.
Therefore, the Jaccard similarity is 3/9 = 0.3.
To get a better understanding of text similarity, we will solve an exercise in the next section.
Exercise 26: Feature Engineering (Text Similarity)
In this exercise, we will calculate the Jaccard and cosine similarity for a given pair of texts. Follow these steps to implement this exercise:
- Open a Jupyter notebook.
- Insert a new cell and add the following code to import the necessary packages:
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
lemmatizer = WordNetLemmatizer()
- Now we declare the pair1, pair2, and pair3 variables, as follows:
pair1 = ["What you do defines you","Your deeds define you"]
pair2 = ["Once upon a time there lived a king.", "Who is your queen?"]
pair3 = ["He is desperate", "Is he not desperate?"]
- We will create a function to extract the Jaccard similarity between a pair of sentences. Add the following code to do this:
def extract_text_similarity_jaccard (text1, text2):
words_text1 = [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(text1)]
words_text2 = [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(text2)]
nr = len(set(words_text1).intersection(set(words_text2)))
dr = len(set(words_text1).union(set(words_text2)))
jaccard_sim = nr/dr
return jaccard_sim
- To check the Jaccard similarity between statements in pair1, write the following code:
extract_text_similarity_jaccard(pair1[0],pair1[1])
The code generates the following output:
Figure 2.44: Jaccard similarity coefficient
- To check the Jaccard similarity between statements in pair2, write the following code:
extract_text_similarity_jaccard(pair2[0],pair2[1])
The code generates the following output:
Figure 2.45: Jaccard similarity coefficient
- To check the Jaccard similarity between statements in pair3, write the following code:
extract_text_similarity_jaccard(pair3[0],pair3[1])
The code generates the following output:
Figure 2.46: Jaccard similarity coefficient
- To check the cosine similarity, we first create a corpus that will have texts of pair1, pair2, and pair3. Add the following code to do this:
tfidf_model = TfidfVectorizer()
corpus = [pair1[0], pair1[1], pair2[0], pair2[1], pair3[0], pair3[1]]
- Now we store the tf-idf representations of the texts of pair1, pair2, and pair3 in a tfidf_results variable. Add the following code to do this:
tfidf_results = tfidf_model.fit_transform(corpus).todense()
- To check the cosine similarity between the initial two texts, we write the following code:
cosine_similarity(tfidf_results[0],tfidf_results[1])
The code generates the following output:
Figure 2.47: Cosine similarity
- To check the cosine similarity between the third and fourth texts, we write the following code:
cosine_similarity(tfidf_results[2],tfidf_results[3])
The code generates the following output:
Figure 2.48: Cosine similarity
- To check the cosine similarity between the fifth and sixth texts, we write the following code:
cosine_similarity(tfidf_results[4],tfidf_results[5])
The code generates the following output:

Figure 2.49: Cosine similarity
Word Clouds
Unlike numeric data, there are very few ways in which text data can be represented visually. The most popular way of visualizing text data is using word clouds. A word cloud is a visualization of a text corpus in which the sizes of the tokens (words) represent the number of times they have occurred. Let's go through an exercise to understand this better.
Exercise 27: Word Clouds
In this exercise, we will visualize the first 10 articles from sklearn's fetch_20newsgroups text dataset using a word cloud. Follow these steps to implement this exercise:
- Open a Jupyter notebook.
- Import the necessary libraries and dataset. Add the following code to do this:
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups
newsgroups_data_sample = fetch_20newsgroups(subset='train')
- To check the data has been fetched, type the following code:
newsgroups_data_sample['data'][:10]
The code generates the following output:
Figure 2.50: Sample from the sklearn dataset
- Now add the following lines of code to create a word cloud:
other_stopwords_to_remove = ['\\n', 'n', '\\', '>', 'nLines', 'nI',"n'"]
STOPWORDS = STOPWORDS.union(set(other_stopwords_to_remove))
stopwords = set(STOPWORDS)
text = str(newsgroups_data_sample['data'][:10])
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
max_words=200,
stopwords = stopwords,
min_font_size = 10).generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
The code generates the following output:

Figure 2.51: Word cloud representation of the first 10 articles
In the next section, we will explore other visualizations, such as dependency parse trees and named entities.
Other Visualizations
Apart from word clouds, there are various other ways of visualizing texts. Some of the most popular ways are listed here:
- Visualizing sentences using a dependency parse tree: Generally, the phrases constituting a sentence depend on each other. We depict these dependencies by using a tree structure known as a dependency parse tree. For instance, the word "helps" in the sentence "God helps those who help themselves" depends on two other words. These words are "God" (the one who helps) and "those" (the ones who are helped).
- Visualizing named entities in a text corpus: In this case, we extract the named entities from texts and highlight them by using different colors.
Let's go through the following exercise to understand this better.
Exercise 28: Other Visualizations (Dependency Parse Trees and Named Entities)
In this exercise, we will look at other visualization methods, such as dependency parse trees and using named entities. Follow these steps to implement this exercise:
- Open a Jupyter notebook.
- Insert a new cell and add the following code to import the necessary libraries:
import spacy
from spacy import displacy
import en_core_web_sm
nlp = en_core_web_sm.load()
- Now we'll depict the sentence "God helps those who help themselves" using a dependency parse tree. Add the following code to implement this:
doc = nlp('God helps those who help themselves.')
displacy.render(doc, style='dep', jupyter=True)
The code generates the following output:
Figure 2.52: Dependency parse tree
- Now we will visualize the named entities of the text corpus. Add the following code to implement this:
text = 'Once upon a time there lived a saint named Ramakrishna Paramahansa. \
His chief disciple Narendranath Dutta also known as Swami Vivekananda \
is the founder of Ramakrishna Mission and Ramakrishna Math.'
doc2 = nlp(text)
displacy.render(doc2, style='ent', jupyter=True)
The code generates the following output:

Figure 2.53: Named entities
Now that you have learned about visualizations, in the next section, we will solve an activity based on them to gain an even better understanding.
Activity 4: Text Visualization
In this activity, we will create a word cloud for the 50 most frequent words in a dataset. The dataset we will use consists of random sentences that are not clean. First, we need to clean them and create a unique set of frequently occurring words.
Note
The text_corpus.txt dataset used in this activity can found at this location: https://bit.ly/2HQ2luS.
Follow these steps to implement this activity:
- Import the necessary libraries.
- Fetch the dataset.
- Perform the pre-processing steps, such as text cleaning, tokenization, stop-word removal, lemmatization, and stemming, on the fetched data.
- Create a set of unique words along with their frequencies for the 50 most frequently occurring words.
- Create a word cloud for these top 50 words.
- Justify the word cloud by comparing it with the word frequency calculated.
Note
The solution for this activity can be found on page 266.