Semantic Analyses of Expository Essays

Julian Park (University of California, Berkeley)

Overview

  1. Setup & Pre-Processing
  2. Hierarchical Clustering
  3. DBSCAN Clustering
    1. Tf-idf Vectorizer
    2. Word Embeddings
  4. Results
  5. Discussion

The question this project hopes to answer is: Which semantic feature in standardized exam essays heavily affects its score? along with other related sub-questions like 'How representative is this feature of the whole essay's conceptual strength?' and 'Does this feature impartially reflect the criteria championed by standardized exams today?'. This is an important question for the humanities, specifically for linguistics and education, that researchers have been working on both in academia and the industry such as this paper and this paper. However, such papers primarily use lexical and grammatical characteristics of the essay to linearly regress the score. This generated significant criticism and backlash from organizations against automated essay scoring in high-stakes assessments with claims that the algorithm is reductive, inaccurate, and unfair. Therefore, this project attempts to find particular methods and features that can computationally score essays at a large scale but also at a deeper semantic level.

We will first examine various clustering methods in depth with AP English Language's most recent 2016 Exam Q3, in which the prompt asks the student to argue their position on the importance of disobedience (i.e. rebellious behavior) in human history. Then we will apply the methods to all the collected essays in the corpus to evaluate the correlation between their actual scores and cluster values.

1. Setup & Pre-Processing

In [1]:
# Import all required tools
import re
import os
import sys
import string
import codecs
import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import DBSCAN
from sklearn.metrics import pairwise
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics.pairwise import cosine_similarity
from scipy.cluster.hierarchy import ward, dendrogram
from scipy.spatial.distance import cosine
from scipy.stats.stats import pearsonr

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
wl = WordNetLemmatizer()

The collected corpus for this project consists of 54 essays from prominent standardized exams (SAT, ACT, AP English Language from 2006 to 2016), each associated with its score and prompt. The essays are well sampled over a wide distribution of low and high scores (detailed summary statistics of corpus are below).

Most of these essays were uploaded in handwritten PDFs online, so each essay was voice-transcribed into individual txt files and tokenized into paragraphs, sentences, and words, removing stopwords. Different stemmers and lemmatizers were attempted, but NLTK's WordNetLemmatizer seemed to work best with this corpus and purpose. This corpus was the most appropriate set of essays available online because of its representativeness of standardized exam essays written by high school students in the U.S., as asked about by this project's question.

In [2]:
def parag_tokenize(essay):
    """Return the essay tokenized into paragraphs."""
    parag_list = []
    temp_parag = ''
    for i in range(len(essay)):
        if essay[i] != '\n':
            temp_parag += essay[i]
        else:
            parag_list.append(temp_parag)
            temp_parag = ''
    return [p for p in parag_list if p not in ['', '\n']]

# Summary statistics of corpus 
scores, words, sents, parags = [], [], [], []
for filename in os.listdir('essays'):
    if filename[-5:] != 'p.txt': # Ignore prompt files
        essay = open("essays/"+filename).read()
        scores.append(int(essay[:3]))
        essay = essay[3:]
        words.append(len(nltk.word_tokenize(essay.lower())))
        sents.append(len(nltk.sent_tokenize(essay.lower())))
        parags.append(len(parag_tokenize(essay.lower())))
        
print("Number of essays =", len(scores))
print("Average score (normalized out of 10) =", np.mean(scores, axis=0))
print("Standard deviation of scores =", np.std(scores))
print("Average number of words =", np.mean(words, axis=0))
print("Average number of sentences =", np.mean(sents, axis=0))
print("Average number of paragraphs =", np.mean(parags, axis=0))
Number of essays = 54
Average score (normalized out of 10) = 6.01851851852
Standard deviation of scores = 2.65616502123
Average number of words = 458.0
Average number of sentences = 20.4074074074
Average number of paragraphs = 4.14814814815
In [3]:
def main_concept(prompt_tokens, essay_tokens):
    """Return intersection between prompt and essay tokens
    (to retrieve main concept and check whether essay is 
    answering the prompt or not)."""
    essay_filt = [wl.lemmatize(w) for w in essay_tokens 
                  if w not in stopwords.words('english') and w not in string.punctuation]
    essay_freq = nltk.FreqDist(essay_filt)
    essay_common = essay_freq.most_common(20)

    prompt_filt = [wl.lemmatize(w) for w in prompt_tokens 
                   if w not in stopwords.words('english') and w not in string.punctuation]
    prompt_freq = nltk.FreqDist(prompt_filt)
    prompt_common = prompt_freq.most_common(20)

    p = set([w[0] for w in prompt_common])
    e = set([w[0] for w in essay_common])
    i = list(p.intersection(e))
    return i

In order to define the main concept of this essay, the main_concept function below outputs word(s) from the intersection of the prompt's and essay's most common words in their frequency distributions. This main concept is important for several reasons later on, such as its removal from clusters because multiple clusters would contain these main concept terms, mostly adding noise to the algorithms. We now focus on 2016's exam:

In [4]:
# Load sample txt files for this in-depth analysis
prompt = open("essays/aplang2016p.txt").read()
essay1 = open("essays/aplang2016h.txt").read()[3:] # Remove score at top of file
essay2 = open("essays/aplang2016m.txt").read()[3:]
essay3 = open("essays/aplang2016l.txt").read()[3:]

prompt_tokens = nltk.word_tokenize(prompt.lower()) # Word level
essay1_tokens = nltk.word_tokenize(essay1.lower()) # Word level
essay2_tokens = nltk.word_tokenize(essay2.lower()) # Word level
essay3_tokens = nltk.word_tokenize(essay3.lower()) # Word level

m1 = main_concept(prompt_tokens, essay1_tokens)
m2 = main_concept(prompt_tokens, essay2_tokens)
m3 = main_concept(prompt_tokens, essay3_tokens)

print(m1)
print(m2)
print(m3)
['disobedience']
['disobedience', 'wilde']
['human', 'use', 'disobedience', 'progress']
In [5]:
# Filter and pre-process each essay to get a processed list of sentences
essay1_filt = [wl.lemmatize(w) for w in essay1_tokens 
               if w not in stopwords.words('english') and w not in m1] # Lemmatize and remove main_concept terms
essay1_sents = nltk.sent_tokenize(' '.join(essay1_filt)) # Sentence level
text_list1 = [re.sub("[,.!?\;\”\“\’-]", ' ', sentence) for sentence in essay1_sents] # Cleaned list of sentences
names1 = range(len(text_list1)) # Index number of each sentence

essay2_filt = [wl.lemmatize(w) for w in essay2_tokens 
               if w not in stopwords.words('english') and w not in m2] # Lemmatize and remove main_concept terms
essay2_sents = nltk.sent_tokenize(' '.join(essay2_filt)) # Sentence level
text_list2 = [re.sub("[,.!?\;\”\“\’-]", ' ', sentence) for sentence in essay2_sents] # Cleaned list of sentences
names2 = range(len(text_list2)) # Index number of each sentence

essay3_filt = [wl.lemmatize(w) for w in essay3_tokens 
               if w not in stopwords.words('english') and w not in m3] # Lemmatize and remove main_concept terms
essay3_sents = nltk.sent_tokenize(' '.join(essay3_filt)) # Sentence level
text_list3 = [re.sub("[,.!?\;\”\“\’-]", ' ', sentence) for sentence in essay3_sents] # Cleaned list of sentences
names3 = range(len(text_list3)) # Index number of each sentence

print(text_list1)
print()
print(text_list2)
print()
print(text_list3)
['newton s first law state object rest stay rest object motion stay motion unless acted upon force  ', 'last part crucial   applying force motion object change  ', 'similar vein   rebellion social progress made  ', 'earliest example american stamp act congress  ', 'american people furious british enacting stamp act  ', 'first direct tax colony period salutary neglect  ', 'tax big deal   rather act taxing got everyone mad  ', 'declared  no taxation without representation  got stamp act repealed   british issue declaratory act response  ', 'however   british decided repeal stamp act  ', 'otherwise   would set precedent british directly tax colony like mainland british people  ', 'eventually   colonist fed colonist another   american revolution surprising victory   america became free britain able start self governing  ', 'people rebelled   would many unhappy british colonist right denied  ', 'employed tactic   able achieve exactly wanted  ', 'another example american history first civil right  ', 'includes notable montgomery bus strike freedom rider diner sit in  ', 'example civil  ', 'montgomery bus strike   black people montgomery   alabama right bus promised equal sitting right  ', 'freedom rider   assisted diner sit in  ', 'would sit diner wait served  ', 'many would served all white diner   thus take space restaurant goers  ', 'nonviolent method   come long way  ', 'segregated seating bus   restaurant  ', 'although seem like small step   effort part larger effort get america realize segregation wrong  ', '  longer live segregated society  ', 'progressed greatly term recognizing equality various people  ', 'waited   nothing would happened  ', 'uncomfortable racism country would dealt   instead treated fact life  ', 'recent example civil include yellow umbrella rebellion hong kong black life matter campaign u  ', 'yellow umbrella rebellion protest china selecting candidate office hong kong  ', 'agreement british handover hong kong china   agreed  one country two policies  system  ', 'allowed hong kong keep democratic nature certain point   would fully acclimated china  ', '  china slowly eroded hong kong s right   recent fixing election china friendly politician running  ', 'yellow umbrella rebellion helped bring international attention issue hong kong face  ', 'done something   hong kong would slowly suffer suffocation china  ', 'similar america fighting right britain   ultimately winning  ', 'black life matter campaign   american raising awareness racism still country  ', 'much like civil right movement earlier   black life matter us mass medium thrust people s face uncomfortable ugly truth racism black people  ', 'includes using protest   social medium    ', 'example social progress  ', 'group feel like right denied  ', 'quiet way attempt reclaim right  ', 'process long daunting   catalyzed act  ', 'although still ongoing act   already started dialogue right thing  ', 'event separating way thing way thing rebellion  ', 'people realize problem   problem must brought face  ', 'problem identified   way change thing actively change via rebellion legislation  ', 'key social change  ']

['according oscar   man s greatest virtue  ', '  claim   progress born  ', 'looking back history   clearly truth wilde s assertion  ', 'society   promotes questioning societal norm   historical change   progressive thought  ', 'person saw intrinsic value  ', 'henry david thoreau   prominent transcendentalist writer   crafted entire essay life plan principle civil  ', 'thoreau quickly developed idea regarding society disdain conformity  ', 'thought soon led action   spent time prison paying tax  ', 'form civil way challenging societal norm prompting others  ', 'transcendentalist thought became wildly popular   read thoreau s emerson s work school   left lasting imprint modern philosophy  ', 'hint thoreau s civil seen throughout 20th century  ', 'idea led pivotal historical change   however   civil right movement  ', 'leading figure martin luther king jr  rosa park urged african american spark social political change act civil  ', 'staged sit in   march   peaceful protest   hope obtaining civil right deserved  ', 'success evident many social legislative change followed civil right movement   one notable civil right act 1964  learned 5th grade field trip civil right museum   purchased t shirt quote :  well behaved woman seldom make history     idea driver historical change clear  ', 'historical change made act   progressive thought also enhanced  ', 'right   bernie sander hanging dear life 2016 presidential election  ', 'although called socialist many   radical philosophy continued garner support countless youth adult alike  ', 'point   though   likely win hillary democratic primary  ', 'despite setback   bernie s political thought asked new question future hold american politics  ', 'view decidedly anti establishment   harkens back non conformist idea transcendentalist  ', 'bernie even supporter martin luther king jr   evidence photograph one rally  ', 'even bernie win democratic nomination   strength progressive idea influenced great change shaping future american society  ', 'oscar wilde s idea drive progress clearly accurate one  ', 'led people questioning society   causing historical change   thinking progressive idea  ', '   we music maker   dreamer dreams  ']

['oscar wilde observed    disobedience   eye anyone read history   man s original virtue  ', 'made   rebellion   history shown different form rebellion resulted basic right today  ', 'it s also seen modern day people currently making history breaking norm society  ', 'without   society would never whole  ', 'satirical  animal farm  george orwell exemplifies rebellion overthrow corrupt communist leader  ', 'novel   character first accept reality leader  ', 'slowly realize unfair treatment put rebelled leader   eventually overthrowing  ', 'novel also analogy soviet union communism joseph stalin  ', 'experience character novel people soviet union caused disobey leader  ', 'able conscious rejecting lie leader feeding  ', 'one important cause rebellion ability think clearly situation  ', 'without   critical thinking process would  ', 'instead   change happen  ', 'don t sit around wait get want   would waste ability  ', 'rebellion come different form promote social  ', 'wouldn t people today without ancestor  ', 'knowledge obtained resulted change struggling experience many people past  ']

2. Hierarchical Clustering

Expository essays assert an argument justified by evidence and reasoning, with the best ones being coherent and convincing over well-developed explanations. In order to determine the strength of an essay, we need some well-operationalized, quantitative measure extracted from it. If the student used multiple examples to support their claim, how does one semantically measure how well-developed each supporting evidence is (while avoiding usage of the paragraph's direct length since people can then more easily game the system)?

The optimal technique for such a task seemed like clustering evidences to its atomic ideas and seeing how many ideas pass some relevancy threshold within the paragraph. The problem with using K-Means to cluster the student's examples in the essay, however, is that it requires the number K of desired clusters a priori. Other clustering methods like hierarchical clustering (first algorithm below) and DBSCAN (next algorithm) allow clustering without a predetermined number of clusters.

Let's analyze with dendrograms the hierarchical clusters of essay1 (high score), essay2 (mediocre score), and essay3 (low score) in-depth to gain a sense of what clutering visually looks like. For each essay, we will get the cleaned sentences from each paragraph, build a document term matrix, then use Ward's method with cosine similarities to display hierarchical clusters (i.e. atomic ideas).

essay1:

In [6]:
# Build document term matrix
vectorizer = CountVectorizer(stop_words='english') 
dtm1 = vectorizer.fit_transform(text_list1) 
vocab1 = vectorizer.get_feature_names()
dtm1 = dtm1.toarray()

# Bucket the essays' sentences into respective paragraphs
essay1p = parag_tokenize(essay1.lower())
essay1p_sents = [nltk.sent_tokenize(p) for p in essay1p]
essay1p_indices, i = [], 0
for p in essay1p_sents:
    essay1p_indices.append(i)
    i += len(p)
print(str(len(essay1p)) + " PARAGRAPHS, STARTING AT INDICES ", essay1p_indices)

# Plot dendrogram 
cos_dist1 = 1 - cosine_similarity(dtm1)
linkage_matrix1 = ward(cos_dist1)
dendrogram(linkage_matrix1, color_threshold=0.6*max(linkage_matrix1[:,2]), orientation="right", labels=names1)
plt.tight_layout()
plt.show()
5 PARAGRAPHS, STARTING AT INDICES  [0, 3, 13, 27, 43]

essay2:

In [7]:
# Build document term matrix
vectorizer = CountVectorizer(stop_words='english') 
dtm2 = vectorizer.fit_transform(text_list2) 
vocab2 = vectorizer.get_feature_names()
dtm2 = dtm2.toarray()

# Bucket the essays' sentences into respective paragraphs
essay2p = parag_tokenize(essay2.lower())
essay2p_sents = [nltk.sent_tokenize(p) for p in essay2p]
essay2p_indices, i = [], 0
for p in essay2p_sents:
    essay2p_indices.append(i)
    i += len(p)
print(str(len(essay2p)) + " PARAGRAPHS, STARTING AT INDICES ", essay2p_indices)

# Plot dendrogram 
cos_dist2 = 1 - cosine_similarity(dtm2)
linkage_matrix2 = ward(cos_dist2)
dendrogram(linkage_matrix2, color_threshold=0.6*max(linkage_matrix2[:,2]), orientation="right", labels=names2)
plt.tight_layout()
plt.show()
5 PARAGRAPHS, STARTING AT INDICES  [0, 4, 10, 15, 23]

essay3:

In [8]:
# Build document term matrix
vectorizer = CountVectorizer(stop_words='english') 
dtm3 = vectorizer.fit_transform(text_list3) 
vocab3 = vectorizer.get_feature_names()
dtm3 = dtm3.toarray()

# Bucket the essays' sentences into respective paragraphs
essay3p = parag_tokenize(essay3.lower())
essay3p_sents = [nltk.sent_tokenize(p) for p in essay3p]
essay3p_indices, i = [], 0
for p in essay3p_sents:
    essay3p_indices.append(i)
    i += len(p)
print(str(len(essay3p)) + " PARAGRAPHS, STARTING AT INDICES ", essay3p_indices)

# Plot dendrogram 
cos_dist3 = 1 - cosine_similarity(dtm3)
linkage_matrix3 = ward(cos_dist3)
dendrogram(linkage_matrix3, color_threshold=0.6*max(linkage_matrix3[:,2]), orientation="right", labels=names3)
plt.tight_layout()
plt.show()
4 PARAGRAPHS, STARTING AT INDICES  [0, 4, 10, 14]

From visual inspection, we see that essay1 has 6 nontrivial clusters, essay2 has 5, and essay3 has 1, where a nontrivial cluster here is defined to have at least two same-colored sentences grouped together. The higher the score, the more clusters the essay seems to have.

3. DBSCAN Clustering

The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm clusters data points in regions of high density and disregards outliers at regions of low density. It is popularly used in scientific literature for its flexibility in varying cluster sizes and shapes as well as robustness to outliers. As mentioned above, the main advantage for our clustering task is that the algorithm outputs the number of nontrivial sentence clusters passing some hyperparameter thresholds (epsilon, min_samples, min/max_df), instead of pre-defining the number of clusters. This may give significant insight into how the essay's arguments are semantically structured.

Let's now analyze density-based clusters in essay1 (high score), essay2 (mediocre score), and essay3 (low score). We first use a Tf-idf Vectorizer to build a matrix based on the frequencies of the words within the essay, where each "document" is a sentence. Next, we use word embeddings to average each word vector in a sentence and build a matrix of sentence vectors. Both matrices will use DBSCAN to identify dense regions and graphically plot the resultant clusters (though the 2D plots do not fully grasp the positions of the high-dimensional vectors).

3A. Tf-idf Vectorizer

In [9]:
# Build DBSCAN model with Tf-idf vectorizer
tfidfvec = TfidfVectorizer(ngram_range=(1,5), min_df = 0.1, max_df = 1.0, decode_error = "ignore")

essay1:

In [10]:
# Run DBSCAN with essay1
X1 = tfidfvec.fit_transform(text_list1).toarray()
db1 = DBSCAN(eps=0.9, min_samples=3).fit(X1)
core_samples_mask = np.zeros_like(db1.labels_, dtype=bool)
core_samples_mask[db1.core_sample_indices_] = True
labels1 = db1.labels_
n_clusters_ = len(set(labels1)) - (1 if -1 in labels1 else 0) # Number of clusters in labels
print(labels1)
print()

clusters1 = {}
for c, i in enumerate(labels1):
    if i == -1:
        continue
    elif i in clusters1:
        clusters1[i].append( text_list1[c] )
    else:
        clusters1[i] = [text_list1[c]]
for c in clusters1:
    print(clusters1[c])
    print()
[0 0 1 2 3 0 3 3 3 3 2 3 0 2 0 2 3 0 4 4 0 0 0 0 3 4 4 5 5 5 5 5 5 5 3 2 3
 0 2 3 3 3 3 1 3 1 0]

['newton s first law state object rest stay rest object motion stay motion unless acted upon force  ', 'last part crucial   applying force motion object change  ', 'first direct tax colony period salutary neglect  ', 'employed tactic   able achieve exactly wanted  ', 'includes notable montgomery bus strike freedom rider diner sit in  ', 'freedom rider   assisted diner sit in  ', 'nonviolent method   come long way  ', 'segregated seating bus   restaurant  ', 'although seem like small step   effort part larger effort get america realize segregation wrong  ', '  longer live segregated society  ', 'includes using protest   social medium    ', 'key social change  ']

['similar vein   rebellion social progress made  ', 'event separating way thing way thing rebellion  ', 'problem identified   way change thing actively change via rebellion legislation  ']

['earliest example american stamp act congress  ', 'eventually   colonist fed colonist another   american revolution surprising victory   america became free britain able start self governing  ', 'another example american history first civil right  ', 'example civil  ', 'black life matter campaign   american raising awareness racism still country  ', 'example social progress  ']

['american people furious british enacting stamp act  ', 'tax big deal   rather act taxing got everyone mad  ', 'declared  no taxation without representation  got stamp act repealed   british issue declaratory act response  ', 'however   british decided repeal stamp act  ', 'otherwise   would set precedent british directly tax colony like mainland british people  ', 'people rebelled   would many unhappy british colonist right denied  ', 'montgomery bus strike   black people montgomery   alabama right bus promised equal sitting right  ', 'progressed greatly term recognizing equality various people  ', 'similar america fighting right britain   ultimately winning  ', 'much like civil right movement earlier   black life matter us mass medium thrust people s face uncomfortable ugly truth racism black people  ', 'group feel like right denied  ', 'quiet way attempt reclaim right  ', 'process long daunting   catalyzed act  ', 'although still ongoing act   already started dialogue right thing  ', 'people realize problem   problem must brought face  ']

['would sit diner wait served  ', 'many would served all white diner   thus take space restaurant goers  ', 'waited   nothing would happened  ', 'uncomfortable racism country would dealt   instead treated fact life  ']

['recent example civil include yellow umbrella rebellion hong kong black life matter campaign u  ', 'yellow umbrella rebellion protest china selecting candidate office hong kong  ', 'agreement british handover hong kong china   agreed  one country two policies  system  ', 'allowed hong kong keep democratic nature certain point   would fully acclimated china  ', '  china slowly eroded hong kong s right   recent fixing election china friendly politician running  ', 'yellow umbrella rebellion helped bring international attention issue hong kong face  ', 'done something   hong kong would slowly suffer suffocation china  ']

In [11]:
# Plotting Tool Source: http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
unique_labels = set(labels1)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
    class_member_mask = (labels1 == k)
    xy = X1[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markeredgecolor='k', markersize=14)
    xy = X1[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markeredgecolor='k', markersize=6)
    
plt.axis([-0.5, 1.5, -0.5, 1.5])
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

essay2:

In [12]:
# Run DBSCAN with essay2
X2 = tfidfvec.fit_transform(text_list2).toarray()
db2 = DBSCAN(eps=0.9, min_samples=3).fit(X2)
core_samples_mask = np.zeros_like(db2.labels_, dtype=bool)
core_samples_mask[db2.core_sample_indices_] = True
labels2 = db2.labels_
n_clusters_ = len(set(labels2)) - (1 if -1 in labels2 else 0) # Number of clusters in labels
print(labels2)
print()

clusters2 = {}
for c, i in enumerate(labels2):
    if i == -1:
        continue
    elif i in clusters2:
        clusters2[i].append( text_list2[c] )
    else:
        clusters2[i] = [text_list2[c]]
for c in clusters2:
    print(clusters2[c])
    print()
[ 0  0  0  1  0  2 -1 -1  2  2  2  1 -1  1  1  1 -1  0  0 -1 -1 -1 -1 -1  1
  0]

['according oscar   man s greatest virtue  ', '  claim   progress born  ', 'looking back history   clearly truth wilde s assertion  ', 'person saw intrinsic value  ', 'although called socialist many   radical philosophy continued garner support countless youth adult alike  ', 'point   though   likely win hillary democratic primary  ', '   we music maker   dreamer dreams  ']

['society   promotes questioning societal norm   historical change   progressive thought  ', 'idea led pivotal historical change   however   civil right movement  ', 'staged sit in   march   peaceful protest   hope obtaining civil right deserved  ', 'success evident many social legislative change followed civil right movement   one notable civil right act 1964  learned 5th grade field trip civil right museum   purchased t shirt quote :  well behaved woman seldom make history     idea driver historical change clear  ', 'historical change made act   progressive thought also enhanced  ', 'led people questioning society   causing historical change   thinking progressive idea  ']

['henry david thoreau   prominent transcendentalist writer   crafted entire essay life plan principle civil  ', 'form civil way challenging societal norm prompting others  ', 'transcendentalist thought became wildly popular   read thoreau s emerson s work school   left lasting imprint modern philosophy  ', 'hint thoreau s civil seen throughout 20th century  ']

In [13]:
unique_labels = set(labels2)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
    class_member_mask = (labels2 == k)
    xy = X2[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markeredgecolor='k', markersize=14)
    xy = X2[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markeredgecolor='k', markersize=6)
    
plt.axis([-0.5, 1.5, -0.5, 1.5])
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

essay3:

In [14]:
# Run DBSCAN with essay3
X3 = tfidfvec.fit_transform(text_list3).toarray()
db3 = DBSCAN(eps=0.9, min_samples=3).fit(X3)
core_samples_mask = np.zeros_like(db3.labels_, dtype=bool)
core_samples_mask[db3.core_sample_indices_] = True
labels3 = db3.labels_
n_clusters_ = len(set(labels3)) - (1 if -1 in labels3 else 0) # Number of clusters in labels
print(labels3)
print()
    
clusters3 = {}
for c, i in enumerate(labels3):
    if i == -1:
        continue
    elif i in clusters3:
        clusters3[i].append( text_list3[c] )
    else:
        clusters3[i] = [text_list3[c]]
for c in clusters3:
    print(clusters3[c])
    print()
[-1 -1 -1 -1  0 -1  0 -1 -1  0 -1 -1 -1 -1 -1 -1 -1]

['satirical  animal farm  george orwell exemplifies rebellion overthrow corrupt communist leader  ', 'slowly realize unfair treatment put rebelled leader   eventually overthrowing  ', 'able conscious rejecting lie leader feeding  ']

In [15]:
unique_labels = set(labels3)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
    class_member_mask = (labels3 == k)
    xy = X3[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markeredgecolor='k', markersize=14)
    xy = X3[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markeredgecolor='k', markersize=6)
    
plt.axis([-0.5, 1.5, -0.5, 1.5])
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

We see that essay1 has 6 nontrivial clusters, essay2 has 3, and essay3 has 1, where a nontrivial cluster here is defined to have at least three sentences with all sentences in a cluster no farther than epsilon apart. The higher the score, the more clusters the essay seems to have. The clustered topics look fairly accurate for essay1.

3B. Word Embeddings

Currently, word2vec (Google, 2013) and GloVe (Stanford, 2014) are the most popular ways to extract vector embeddings out of words. Word2vec is a predictive model and GloVe a count-based model, but both provide decent word vectors to use for many NLP analyses. In this project, we will use GloVe (Global Vectors for Word Representation), an unsupervised learning algorithm obtaining word vectors pre-trained on word-word co-occurrence statistics.

For each essay, we link each word in a sentence with its word vector (from the GloVe mapping), then average those vectors to get a sentence vector. Then those sentences are clustered with DBSCAN over a cosine similarity metric. This clustering approach does not produce good results for 2016's exam (no direct relationship with essay1, essay2, essay3 having 2, 1, 2 clusters respectively), but further analysis with the rest of the corpus shows it is a noteworthy measure of an essay's strength.

In [16]:
word2vec = {} # Dictionary mapping English words to its embedded vector
with codecs.open("glove.6B/glove.6B.200d.txt", 'r', 'utf-8') as file: # 200 dimensions for each vector
        for c, r in enumerate(file):
                sr = r.split()
                word2vec[sr[0]] = np.array([float(i) for i in sr[1:]])
In [17]:
# Build X1 matrix by averaging each sentence's word vectors
X1 = []
wvs1 = []
for sent in text_list1:
    for word in nltk.word_tokenize(sent):
        wv = word2vec.get(word, np.zeros(200)) # Zero vector for nonexistent words
        wvs1.append(wv)
    X1.append(np.mean(wvs1, axis=0)) # Add sentence vector

db1 = DBSCAN(eps=0.0011, min_samples=3, metric='cosine', algorithm='brute').fit(X1)
core_samples_mask = np.zeros_like(db1.labels_, dtype=bool)
core_samples_mask[db1.core_sample_indices_] = True
labels1 = db1.labels_
n_clusters_ = len(set(labels1)) - (1 if -1 in labels1 else 0) # Number of clusters in labels
print(labels1)
print('Estimated number of clusters: %d' % n_clusters_)
print()

clusters1 = {}
for c, i in enumerate(labels1):
    if i in clusters1:
        clusters1[i].append( text_list1[c] )
    else:
        clusters1[i] = [text_list1[c]]
for c in clusters1:
    print(clusters1[c])
    print()
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1]
Estimated number of clusters: 2

['eventually   colonist fed colonist another   american revolution surprising victory   america became free britain able start self governing  ', 'people rebelled   would many unhappy british colonist right denied  ', 'employed tactic   able achieve exactly wanted  ', 'another example american history first civil right  ', 'includes notable montgomery bus strike freedom rider diner sit in  ', 'example civil  ']

['montgomery bus strike   black people montgomery   alabama right bus promised equal sitting right  ', 'freedom rider   assisted diner sit in  ', 'would sit diner wait served  ', 'many would served all white diner   thus take space restaurant goers  ', 'nonviolent method   come long way  ', 'segregated seating bus   restaurant  ', 'although seem like small step   effort part larger effort get america realize segregation wrong  ', '  longer live segregated society  ', 'progressed greatly term recognizing equality various people  ', 'waited   nothing would happened  ', 'uncomfortable racism country would dealt   instead treated fact life  ', 'recent example civil include yellow umbrella rebellion hong kong black life matter campaign u  ', 'yellow umbrella rebellion protest china selecting candidate office hong kong  ', 'agreement british handover hong kong china   agreed  one country two policies  system  ', 'allowed hong kong keep democratic nature certain point   would fully acclimated china  ', '  china slowly eroded hong kong s right   recent fixing election china friendly politician running  ', 'yellow umbrella rebellion helped bring international attention issue hong kong face  ', 'done something   hong kong would slowly suffer suffocation china  ', 'similar america fighting right britain   ultimately winning  ', 'black life matter campaign   american raising awareness racism still country  ', 'much like civil right movement earlier   black life matter us mass medium thrust people s face uncomfortable ugly truth racism black people  ', 'includes using protest   social medium    ', 'example social progress  ', 'group feel like right denied  ', 'quiet way attempt reclaim right  ', 'process long daunting   catalyzed act  ', 'although still ongoing act   already started dialogue right thing  ', 'event separating way thing way thing rebellion  ', 'people realize problem   problem must brought face  ', 'problem identified   way change thing actively change via rebellion legislation  ', 'key social change  ']

['newton s first law state object rest stay rest object motion stay motion unless acted upon force  ', 'last part crucial   applying force motion object change  ', 'similar vein   rebellion social progress made  ', 'earliest example american stamp act congress  ', 'american people furious british enacting stamp act  ', 'first direct tax colony period salutary neglect  ', 'tax big deal   rather act taxing got everyone mad  ', 'declared  no taxation without representation  got stamp act repealed   british issue declaratory act response  ', 'however   british decided repeal stamp act  ', 'otherwise   would set precedent british directly tax colony like mainland british people  ']

In [18]:
X2 = []
wvs2 = []
for sent in text_list2:
    for word in nltk.word_tokenize(sent):
        wv = word2vec.get(word, np.zeros(200)) # Zero vector for nonexistent words
        wvs2.append(wv)
    X2.append(np.mean(wvs2, axis=0)) # Add sentence vector
    
db2 = DBSCAN(eps=0.0011, min_samples=3, metric='cosine', algorithm='brute').fit(X2)
core_samples_mask = np.zeros_like(db2.labels_, dtype=bool)
core_samples_mask[db2.core_sample_indices_] = True
labels2 = db2.labels_
n_clusters_ = len(set(labels2)) - (1 if -1 in labels2 else 0) # Number of clusters in labels
print(labels2)
print('Estimated number of clusters: %d' % n_clusters_)
print()

clusters2 = {}
for c, i in enumerate(labels2):
    if i in clusters2:
        clusters2[i].append( text_list2[c] )
    else:
        clusters2[i] = [text_list2[c]]
for c in clusters2:
    print(clusters2[c])
    print()
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  0  0  0  0  0  0  0  0  0  0  0
  0]
Estimated number of clusters: 1

['success evident many social legislative change followed civil right movement   one notable civil right act 1964  learned 5th grade field trip civil right museum   purchased t shirt quote :  well behaved woman seldom make history     idea driver historical change clear  ', 'historical change made act   progressive thought also enhanced  ', 'right   bernie sander hanging dear life 2016 presidential election  ', 'although called socialist many   radical philosophy continued garner support countless youth adult alike  ', 'point   though   likely win hillary democratic primary  ', 'despite setback   bernie s political thought asked new question future hold american politics  ', 'view decidedly anti establishment   harkens back non conformist idea transcendentalist  ', 'bernie even supporter martin luther king jr   evidence photograph one rally  ', 'even bernie win democratic nomination   strength progressive idea influenced great change shaping future american society  ', 'oscar wilde s idea drive progress clearly accurate one  ', 'led people questioning society   causing historical change   thinking progressive idea  ', '   we music maker   dreamer dreams  ']

['according oscar   man s greatest virtue  ', '  claim   progress born  ', 'looking back history   clearly truth wilde s assertion  ', 'society   promotes questioning societal norm   historical change   progressive thought  ', 'person saw intrinsic value  ', 'henry david thoreau   prominent transcendentalist writer   crafted entire essay life plan principle civil  ', 'thoreau quickly developed idea regarding society disdain conformity  ', 'thought soon led action   spent time prison paying tax  ', 'form civil way challenging societal norm prompting others  ', 'transcendentalist thought became wildly popular   read thoreau s emerson s work school   left lasting imprint modern philosophy  ', 'hint thoreau s civil seen throughout 20th century  ', 'idea led pivotal historical change   however   civil right movement  ', 'leading figure martin luther king jr  rosa park urged african american spark social political change act civil  ', 'staged sit in   march   peaceful protest   hope obtaining civil right deserved  ']

In [19]:
X3 = []
wvs3 = []
for sent in text_list3:
    for word in nltk.word_tokenize(sent):
        wv = word2vec.get(word, np.zeros(200)) # Zero vector for nonexistent words
        wvs3.append(wv)
    X3.append(np.mean(wvs3, axis=0)) # Add sentence vector
    
db3 = DBSCAN(eps=0.0011, min_samples=3, metric='cosine', algorithm='brute').fit(X3)
core_samples_mask = np.zeros_like(db3.labels_, dtype=bool)
core_samples_mask[db3.core_sample_indices_] = True
labels3 = db3.labels_
n_clusters_ = len(set(labels3)) - (1 if -1 in labels3 else 0) # Number of clusters in labels
print(labels3)
print('Estimated number of clusters: %d' % n_clusters_)
print()

clusters3 = {}
for c, i in enumerate(labels3):
    if i in clusters3:
        clusters3[i].append( text_list3[c] )
    else:
        clusters3[i] = [text_list3[c]]
for c in clusters3:
    print(clusters3[c])
    print()
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1  0  0  0  1  1  1  1]
Estimated number of clusters: 2

['one important cause rebellion ability think clearly situation  ', 'without   critical thinking process would  ', 'instead   change happen  ']

['don t sit around wait get want   would waste ability  ', 'rebellion come different form promote social  ', 'wouldn t people today without ancestor  ', 'knowledge obtained resulted change struggling experience many people past  ']

['oscar wilde observed    disobedience   eye anyone read history   man s original virtue  ', 'made   rebellion   history shown different form rebellion resulted basic right today  ', 'it s also seen modern day people currently making history breaking norm society  ', 'without   society would never whole  ', 'satirical  animal farm  george orwell exemplifies rebellion overthrow corrupt communist leader  ', 'novel   character first accept reality leader  ', 'slowly realize unfair treatment put rebelled leader   eventually overthrowing  ', 'novel also analogy soviet union communism joseph stalin  ', 'experience character novel people soviet union caused disobey leader  ', 'able conscious rejecting lie leader feeding  ']

4. Results

The clustering methods above are now executed with the rest of the essays in the corpus to examine patterns in the resultant values. Pearson correlation coefficients will be used to quantify how correlated the essays' scores and cluster values are, where the coefficient value ranges between -1 and +1 (−1 is total negative linear correlation, 0 is no linear correlation, and 1 is total positive linear correlation).

In [21]:
# Run over all collected SAT, ACT, and AP English Language essays from 2006 to 2016
essay_scores = []
tfidf_clusters = []
weight_tfidf_clusters = []
glove_clusters = []
weight_glove_clusters = []
num = []
for filename in os.listdir('essays'):
    if filename[-5:] != 'p.txt': 
        prompt = open("essays/"+filename[:-5]+"p.txt").read()
        essay = open("essays/"+filename).read()
        essay_scores.append(int(essay[:3])) # Add this essay's score
        essay = essay[3:]
        prompt_tokens = nltk.word_tokenize(prompt.lower())
        essay_tokens = nltk.word_tokenize(essay.lower()) 
        m = main_concept(prompt_tokens, essay_tokens)
        num.append(len(essay_tokens))

        essay_filt = [wl.lemmatize(w) for w in essay_tokens if w not in stopwords.words('english')]
        essay_sents = nltk.sent_tokenize(' '.join(essay_filt))
        text_list = [re.sub("[,.!?\;\”\“\’-]", ' ', sentence) for sentence in essay_sents]


        X = tfidfvec.fit_transform(text_list).toarray()
        db = DBSCAN(eps=0.9, min_samples=2).fit(X)
        n_clusters_t = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
        tfidf_clusters.append(n_clusters_t) # Add this essay's tf-idf number of clusters

        clusters = {}
        for c, i in enumerate(db.labels_):
            if i == -1:
                continue # Ignore non-clustered sentences
            if i in clusters:
                clusters[i].append( text_list[c] )
            else:
                clusters[i] = [text_list[c]]
        avg_words = 1
        if len(clusters) > 0:
            avg_words = np.mean([sum([len(s.split()) for s in clusters[c]]) for c in clusters]) # Avg number of words in clusters
        weight_tfidf_clusters.append((n_clusters_t+1)*avg_words) # Add this essay's tf-idf number of clusters

        X = []
        wvs = []
        for sent in text_list:
            for word in nltk.word_tokenize(sent):
                wv = word2vec.get(word, np.zeros(200)) # Zero vector for nonexistent words
                wvs.append(wv)
            X.append(np.mean(wvs, axis=0))
        db = DBSCAN(eps= 0.0011, min_samples=3, metric='cosine', algorithm='brute').fit(X)
        n_clusters_g = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
        glove_clusters.append(n_clusters_g) # Add this essay's GloVe number of clusters

        clusters = {}
        for c, i in enumerate(db.labels_):
            if i == -1:
                continue # Ignore non-clustered sentences
            if i in clusters:
                clusters[i].append(text_list[c])
            else:
                clusters[i] = [text_list[c]]
        avg_words = 1
        if len(clusters) > 0:
            avg_words = np.mean([sum([len(s.split()) for s in clusters[c]]) for c in clusters]) # Avg number of words in clusters
        weight_glove_clusters.append((n_clusters_g+1)*avg_words) # Add this essay's GloVe weighted number of clusters

        
print("PCC with Tfidf Clusters =", pearsonr(essay_scores, tfidf_clusters)[0])
print("p-value =", pearsonr(essay_scores, tfidf_clusters)[1])
print()
print("PCC with Weighted Tfidf Clusters =", pearsonr(essay_scores, weight_tfidf_clusters)[0])
print("p-value =", pearsonr(essay_scores, weight_tfidf_clusters)[1])
print()
print("PCC with GloVe Clusters =", pearsonr(essay_scores, glove_clusters)[0])
print("p-value =", pearsonr(essay_scores, glove_clusters)[1])
print()
print("PCC with Weighted GloVe Clusters =", pearsonr(essay_scores, weight_glove_clusters)[0])
print("p-value =", pearsonr(essay_scores, weight_glove_clusters)[1])

plt.xlabel('Essay Score')
plt.ylabel('Weighted GloVe Clusters')
plt.scatter(essay_scores, weight_glove_clusters)
fit = np.polyfit(essay_scores, weight_glove_clusters, 1)
fit_fn = np.poly1d(fit) 
plt.plot(essay_scores,weight_glove_clusters, 'yo', essay_scores, fit_fn(essay_scores), '--k')
plt.show()
PCC with Tfidf Clusters = 0.333341410649
p-value = 0.0137730543202

PCC with Weighted Tfidf Clusters = 0.524381659774
p-value = 4.70120675279e-05

PCC with GloVe Clusters = 0.437472903673
p-value = 0.000939830264862

PCC with Weighted GloVe Clusters = 0.779088961779
p-value = 3.98097945166e-12

The hypothesis that I made in the project proposal was that magnitudes of the semantic clusters (connoting how well-developed each example is) will have a positive correlation with the essay's score. That is, an essay that simply discusses one example topic throughout the essay would not strongly support their thesis with varied evidence, and at the other extreme, an essay that discusses many trivial clusters (i.e. clusters that do not pass the minimum relevancy threshold) would not have well-developed arguments supporting their thesis. The clusters used in the methods above were required to be nontrivial by passing the thresholds set by the algorithms' parameters, where a "nontrivial" cluster is defined to contain more than two or three sentences. Such hyperparameters (color_thresholds for Hierarchical, epsilon and min_samples for DBSCAN) were tweaked to yield the most variation among the number of clusters formed.

The number of clusters from the tf-idf vectorizer did not provide remarkable results — a +0.33 correlation with the essay's actual score — although the number of GloVe clusters did give better results with a +0.44 correlation. In order to reward semantic clusters of more conceptual development, I then weighted the clusters with its size, i.e. the number of words found inside, to connote well-developed-ness. This gave a significant boost to the correlation coefficients for both DBSCAN methods, ultimately validating my hypothesis: an outstanding +0.78 correlation and a p-value virtually zero rejects the null hypothesis, and I can say with ~100% confidence that there is indeed a positive correlation between the essay scores and their respective weighted GloVe cluster values. This relationship in data is plotted above. This makes sense to me: there should be more well-developed (nontrivial) ideas in higher-scoring essays.

5. Discussion

This project focused on semantic relationships between the student's ideas and explanations, as opposed to the more lexical and grammatical features used by other essay scoring algorithms in the past. The results of this vector space analysis from both clustering algorithms did interestingly point towards that semantic direction. The high correlation between the cluster values and the essay scores confirms my hypothesis that the magnitudes of semantic clusters do indeed have a positive relationship with the essay's score.

Every year in College Board's grading report for the AP English Language exam, they announce that "the evidence and explanations used are appropriate and convincing, and the argument is especially coherent and well developed" for high scoring essays. The results of this project agrees with their statement, which additionally emphasizes to us the importance of students focusing on and mastering this skill of writing convincing, well-developed essays. If exam tutors or test prep schools are not prioritizing this particular skill already, they should.

There is, however, a fundamental limitation to this research: no external knowledge of the real world is taken into account. In spite of how sophisticated the approach is, no algorithm only contained within the scope of the given essay and prompt will be able to provide a full evaluation of the student's work. For example, in 2016's high-scoring essay, only external knowledge from, say, online resources or Wikipedia pages would be able to understand why the Black Lives Matter movement is relevant to civil disobedience. To a computer, the movement's name is only a literal phrase without any contextual meaning.

Going forward, there needs to be even deeper textual analyses, focusing more on logic extraction and validating evidence clusters with external knowledge. Through this analysis, I learned that generating valuable semantic measures of conceptual strength is difficult. In some years, lower-scoring essays talk about as many "nontrivial clusters" as higher-scoring essays; it's just that the logical reasoning behind those clusters aren't strong enough for lower-scoring essays. Investigating how to further measure strength in logic out of the paragraphs' ideas will be a fascinating task for further research.

In [ ]: