TEDxKinda Project: Automated Summarization
Have you seen any word clouds floating around the Internet? For example, the Wordle image to the right is a word cloud generated using the text of all the webpages included in the Big Data module.
Some of these word clouds come from Wordle:
Wordles are an attempt to automatically summarize a large amount of data visually. It relies on word counts to identify what might be important ideas; however, there is a variety of strategies that one can use in order to automate summarization, and each strategy has advantages and disadvantages related to its utility and its validity.
Humans are always looking for shortcuts, timesavers, and ways to make their lives easier. This includes attempts to read and comprehend long texts. Perhaps you have used Cliff’s Notes instead of reading a novel for an English literature class. Still, some person (maybe Cliff?) had to read the original book and create the summary.
Automated summarization, in which a computer program produces a shortened version of a selected body of text, sounds like a dream come true for every time-crunched high school student assigned Tolstoy’s War and Peace. This may not be feasible now, but automated summarization strategies are already widespread in our culture.
- A student’s grade or score on a test is meant to summarize the student’s understanding of the course material or individual concepts measures on a test.
- A credit score is a numeric value associated with a consumer’s credit risk. Missed payments, excessive credit checks, and amount of outstanding debt all contribute to a final numeric FICO score in the range of 350 to 800.
Automated summaries like these can be both usable and useful—except that summarization of data comes at a cost. Automated summarization is lossy. Summarization attempts to reduce complexity by removing redundant or otherwise less significant details. Effectively, this is a form of dimension reduction.
However, these details can not be recovered given just the summary. The process maps the complexity of large sets of data to a simpler, smaller data pool. To illustrate, it is impossible to determine which test items were missed on an exam by merely knowing the student’s total score, or to understand the finer points of this Big Data unit by examining the Wordle above.
Three Methods for Text Summarization
Let’s explore three different ways in which to summarize text algorithmically. Note that we are trying to create a process that takes a text and generates a paraphrased summary of what the text is about. However, at no point do we actually take into consideration the meaning of any of the words. How is this possible?
When text is written, it has an inherent structure. Sentences are made up according to certain rules like
SUBJECT precedes VERB precedes OBJECT. Of course, other languages, and other creatures—like Yoda—may have different rules.
Three common techniques for automated text summarization are outlined below: highest word frequency, TF*IDF, and topic sentence concatenation.
The Highest Word Frequency method
This is the most intuitive of the algorithms, and the one that is used by Wordle:
- Parse the text and keep a count of all the words read separately.
- Sort the list so that the most frequent words are first.
- Remove words from a stop word list.
- Sample “stop word” list
- Use the X (such as 10—20) most frequent words as a summary.
Think about the following: What is the point of the stop word list? How could this summarization method be exploited by spammers?
The TF*IDF method
The TF*IDF method extends the previous one by adding an assumption that only uncommon words are useful in a summary:
The words that appear in all documents are not useful in differentiating among them, so you begin by finding the most common words and eliminate them. This is like using a stop word list, only you make the list dynamically as you go rather than depending on a pre-determined master list.
- Calculate the TF (term frequency) for each word. Basically, this means taking each of the word counts you generated in the previous method and dividing them by the total number of words in the text:
- Calculate the IDF (inverse document frequency) for each word. This means figuring out how many documents out of all the ones you are processing contain the word. As an example, you might leverage Google to get these counts. Search the term using “quotation marks.” Google will return the number of documents it retrieved containing exactly that term. Now we just divide all the English documents Google has indexed total, by the number of documents it indexed with that term in order to find the IDF.
- Of course, we don’t know what that total number of documents that Google has indexed total, so let’s cheat. Assume that every document contains the word “the.” Use the document count of the word “the” as the total document count. A Google search returned 10,690,000,000 documents when this page was created (March 29, 2014).
- Calculate the TF*IDF for each word using this formula:
- Sort the list so that the highest TF*IDF words are first.
- Use the top X (such as 10–20) highest ranked TF*IDF words as a summary.
Think about the following: How might this compare with the highest word frequency method? What’s the major difference?
The Topic Sentence Concatenation method
This method assumes that the main idea of each paragraph is the first sentence, so the topic sentences of each paragraph can encapsulate the meaning of the whole document. For this method, beginning at the top of the document:
- Scan the first sentence and add it to your summary.
- Skip remaining text until you reach either the next paragraph or the end of the document.
- If you reach the next paragraph, repeat steps 1 and 2.
- If you reach the end, your summary is complete.
Think about the following: How is this method different than the others? How does it fare if you limit the word count to X as you did with the other methods?
Instructions: Utilize Automated Summarization for your TEDxKinda Presentation
Your challenge is to apply one of the automated summarization strategies (i.e., word cloud, Highest Word Frequency, TF*IDF, or Topic Sentence Concatenation) to a text (or combine multiple texts for even more data), considering which summarization strategy might be the most useful. You may use automated summarization Tools for Big Data Analysis, which may be more challenging, but also much more fruitful for your research. Be sure to examine texts that are related to your TEDxKinda topic, so that you can apply your work to your end presentation.
Submit a document (e.g., .doc or .pdf) that includes the following items:
- A proper heading (including names, date, assignment, and title),
- The automated summarization,
- An explanation detailing why you chose the method you did.