Mini-Project Digital NLP Fresco Play Hands-on Solution Hacker Rank Solution

 


🚀 Explore the capabilities of NLTK (Natural Language Toolkit) to supercharge your text processing and Natural Language Processing (NLP) tasks. This guide takes you from basic tokenization to sentiment analysis and ends with applying the K-Nearest Neighbors (KNN) algorithm for intelligent text classification.


LAB 1: Introduction to NLP Text Processing | MLT Sprint 5 - Case Study 1


🧠 Task 1: NLP - Python - Processing Raw Text | MLT Sprint 5 – LAB 1

Are you ready to dive into the fundamentals of Natural Language Processing (NLP) using Python? In this case study from Sprint 5, we’ll explore how to process raw text data from the web, tokenize it, and extract meaningful insights using Python’s popular NLTK library.

This is a perfect hands-on mini-project for anyone who’s starting with Machine Learning and Text Analytics.


📌 Objective:

Build a function called processRawText that:

  • Fetches and processes raw text from a URL.
  • Tokenizes the text.
  • Counts total and unique words.
  • Computes word coverage and frequency distribution.


Sample Output
210
127
1
The


#!/bin/python3
 
import math
import os
import random
import re
import sys
import zipfile
os.environ['NLTK_DATA'] = os.getcwd() + "/nltk_data"
import nltk
 
#
# Complete the 'processRawText' function below.
#
# The function accepts STRING textURL as parameter.
#
 
def processRawText(textURL):
    import requests
    response = requests.get(textURL)
   
    words  = nltk.tokenize.word_tokenize(response.text) # Tokenize into words
    lower_words = list(map( str.lower, words))
    noofwords = len(lower_words)
   
    unique_words = set(lower_words)
    noofunqwords = len(unique_words)
   
    wordcov = math.floor(noofwords/noofunqwords)
   
    from collections import Counter
    mostCommon = Counter(lower_words)
    maxfreq= mostCommon.most_common(1)[0][0]
    return noofwords, noofunqwords, wordcov, maxfreq
 
if __name__ == '__main__':
    textURL = input()
 
    if not os.path.exists(os.getcwd() + "/nltk_data"):
        with zipfile.ZipFile("nltk_data.zip", 'r') as zip_ref:
            zip_ref.extractall(os.getcwd())
 
    noofwords, noofunqwords, wordcov, maxfreq =
processRawText(textURL)
    print(noofwords)
    print(noofunqwords)
    print(wordcov)
    print(maxfreq)


LAB 2: Case Study 2 - NLP - Text Representation

Fresco Course ID: 2632

Solution: Run the below cell to install the needed libraries

pip install nltk



df = pd.read_csv('dataset.csv')



Task 1: Use Count Vectorizer to find the vocabulary for the given data set and store it in the variable S1. Note: Output must be dataframe and it’s column name should be ‘order’.


vectorizer = CountVectorizer()
X= vectorizer.fit_transform(df.iloc[:, 0]) # Fit the vectorizer to the text data
S1 = pd.DataFrame({'order': list(vectorizer.get_feature_names())})
#S1 = pd.DataFrame(sorted(vectorizer.vocabulary_.keys()), columns=['order']) # Alternative way


Task 2: Find the Bag of words for the given data set and store it in the variable S2. Note: Output must be dataframe and it’s column names should be the feature of words(get_feature_names).


# Find Bag of Words representation and store in S2
S2 = pd.DataFrame(X.toarray(), columns=
vectorizer.get_feature_names())




Task 3: Find the Term Frequency (TF) with norm ‘l1’ and disable use_idf for the given dataset and store it in the variable S3. Note: Output must be dataframe and it’s column names should be the feature of words(get_feature_names)


# Initialize TfidfVectorizer with L1 normalization and use_idf=False
vectorizer = TfidfVectorizer(norm='l1', use_idf=False)
# Fit and transform the text data
X = vectorizer.fit_transform(df['reviews'])
# Convert to DataFrame
S3 = pd.DataFrame(X.toarray(), columns=
vectorizer.get_feature_names())



Task 4: Find the Term Frequency (TF) with norm ‘l2’ and disable use_idf for the given dataset and store it in the variable S4. Note: Output must be dataframe and it’s column names should be the feature of words(get_feature_names).


#write your code below
vectorizer = TfidfVectorizer(norm='l2', use_idf=False)
# Fit and transform the text data
X = vectorizer.fit_transform(df['reviews'])
# Convert to DataFrame
S4 = pd.DataFrame(X.toarray(), columns=
vectorizer.get_feature_names())



Task 5: Find the TF*IDF (TFIDF) value for the given dataset and store it in the variable S5. Note: Output must be dataframe and it’s column names should be the feature of words(get_feature_names).


#write your code below
vectorizer = TfidfVectorizer()
# Fit and transform the text data
X = vectorizer.fit_transform(df['reviews'])
# Convert to DataFrame
S5 = pd.DataFrame(X.toarray(), columns=
vectorizer.get_feature_names())



Task 6: Find the Inverse Document Frequency (IDF) value with soomth_idf as false for the given dataset and store it in the variable S6.


Note: Output must be dataframe and it’s index should be the feature of words(get_feature_names) and column name should be ‘values’.

#write your code below
vectorizer = TfidfVectorizer(smooth_idf=False)
# Fit and transform the text data
X = vectorizer.fit_transform(df['reviews'])
# Get the IDF values (these are part of the model)
idf_values = vectorizer.idf_
# Convert to DataFrame
S6 = pd.DataFrame(idf_values, index=
vectorizer.get_feature_names(),columns=['values'])



LAB 3: Welcome to MLT - Sprint 5 - Case Study 3 - NLP Sentiment Analysis

Case Study 3 - Sentiment Analysis


Instruction!The data set required for this task is given in the file name 'SA_dataset.csv'.
Read the question then perform the solution and assign the answer to the respective variables given in the cells below

Import packages and read dataset.

import pandas as pd
import nltk
import re
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('stopwords')
nltk.download('vader_lexicon')
 
from nltk.corpus import stopwords



🧠 Use Case: Sentiment Analysis on Text Data with CSV Output

In this hands-on Natural Language Processing (NLP) task, we’ll walk through how to perform sentiment analysis on a dataset and export the final result as a CSV file. The goal is to classify each text as either Positive or Negative, and store it with a corresponding numeric value in a new file named sentiment.csv.


🎯 Objective:

Build an end-to-end text classification pipeline that:

  • Reads the dataset from a CSV file.
  • Cleans and preprocesses the textual content.
  • Removes unnecessary noise like numbers and special characters.
  • Filters out common stop words.
  • Predicts sentiment (positive or negative) using a trained model or rule-based logic.
  • Stores the sentiment label and its corresponding numeric value (1 for Positive, 0 for Negative) in a CSV file.


#write your code below
 
df = pd.read_csv('SA_dataset.csv')

# Convert text to lowercase and remove numbers, special characters
df['content'] = df['content'].str.lower()
df['content'] = df['content'].apply(lambda x: re.sub(r'[^a-z\s]', '', x))

# Remove stopwords
stop_words = set(stopwords.words('english'))
df['content'] = df['content'].apply(lambda x: ' '.
join([word for word in x.split() if word not in stop_words]))


# Perform Sentiment Analysis using VADER
sia = SentimentIntensityAnalyzer()
def analyze_sentiment(text):
    score = sia.polarity_scores(text)['compound']
    return 'Positive' if score >= 0 else 'Negative',
1 if score >= 0 else 0
   
   
# Apply sentiment analysis
df[['sentiment', 'value']] = df['content'].
apply(lambda x: pd.Series(analyze_sentiment(x)))


# Save results to sentiment.csv
df.to_csv('sentiment.csv', index=False)

# Display the first few rows
print(df.head())
 




Post a Comment

0 Comments