Creating Structure from Unstructured Data Sets

From Unstructured to Structured

Think about the following scenario and what data are involved:

A gas station has installed cameras and sensors to track the usage of gasoline and make predictions to maintain stock of the right fuels and achieve the best pricing strategies.

Which of the following items are data in this scenario?

Is it data? Answer
Amount of gas pumped Yes / No
Type of gas pumped Yes / No
Price per gallon of gas pumped Yes / No
Date Yes / No
Time of day Yes / No
License plate number Yes / No
Make/model of car Yes / No
Color of car Yes / No

That was a trick question. The correct answer to all of these questions is “Yes.” All of these are examples of data.

Data are pieces of information that are observable and/or measurable. In this scenario, even the color of a car is a characteristic that can be observed, measured, and analyzed—regardless of whether it seems useful. However, the raw data (camera footage) is unstructured. By creating structured data sets, we can make data more usable and useful.

Unstructured vs. Structured Data

Raw, unstructured data can be overwhelming to use if there is too much, or simply lack any organization that would make the data usable and useful. In order to make unstructured data more usable and useful, we apply structure and organization to it, usually after collection. However, structure and organization applied after collection alters it through selection and organization of pertinent details. Details not captured in the resulting organized set may be lost.

Unstructured data contain everything collected in “raw” form, but connections and relationships among strands of data are both harder to trace and much slower to process than structured data sets. On the other hand, structured data are easy to access and organize, but may lack the big picture and details that unstructured data may possess.

Let’s consider the following analogy of turning logs into lumber and lumber into a barn:

These logs are like raw, unstructured data. In their present state they are not very usable or useful. However, the trees themselves would be even more raw with less structure. By cutting them down, we lost some wood in the form of branches, roots, and leaves. To make these logs more useful and usable, we apply some structure by converting them into lumber. This lumber is much more usable and useful. For example, one could use it to build a table, a doghouse, a soapbox derby car, or barn. However, by turning the logs into lumber, we lost some wood in the form of wood chips, tree bark, and sawdust. To make the lumber more useful and usable, one could build a barn structure. To do this, one has to saw the lumber into different, more useful and usable shapes. However, this process cannot be reversed, so again, we lose some wood chips and sawdust.

Every time we apply structure to an unstructured data set, it becomes more difficult (sometimes impossible) to gain back some of the unstructured data. Think about it this way:

  • By turning trees into logs, we lose some “data” (branches, roots, and leaves).
  • By turning logs into lumber, we lose some more “data” (wood chips, bark, and sawdust).
  • By turning lumber into a barn, we lose even more “data” (wood chips and sawdust).

However, by adding structure, we also make the data more usable and useful. Hence, one has to weigh the usability and usefulness of structured data against the data loss that occurs from structuring unstructured data sets.

Common misconception: “Unstructured data” have no structure.
  • At some level, there is structure to even “unstructured data.” For example, the binary representation and the particular encoding used for text, video, etc. are forms of structure. “Structure” in the sense of “structured data” means that some level of organization according to their intended use has been applied to the data before they are stored. With unstructured data, data are stored as collected with any necessary structure applied later during analysis.

Utility of Unstructured and Structured Data

Processing unstructured data facilitates knowledge discovery, because emergent groupings (i.e., structure) in data may depend on characteristics that are unknown before data are collected. Knowledge discovery often depends on unforeseen characteristics that may be obscured with structured data. However, when data are imposed with a structure before they are processed, unidentified relationships may be harder to detect than they might be with raw, unstructured data.

Structured data is dependent on known characteristics. For example, data structures could be based on certain prescribed characteristics, such as Employee Name, Employee ID, etc. These structures allow data to be organized into schema that may be more usable and useful. For instance, applying structure to data sets often facilitates quicker data retrieval than with unstructured data: rather than searching through all possible data and selecting the appropriate information (sequential/serial access), structured data are able to exploit predefined methods to access data directly (indexed/random access). Therefore, structured data are usually more efficiently processed than unstructured data. To summarize briefly:

  • Unstructured data:

    • Everything collected in “raw” form, but connections and relationships among strands of data are both harder to trace and much slower to process
    • Requires much more storage space than structured data
  • Structured data:

    • Data that is identifiable because it is organized in a structure
    • Requires less storage space than unstructured data
    • Organized

Let’s examine the data contained in a Walmart receipt, in order to outline the differences between processing unstructured data and structured data.

Unstructured data processing:

With unstructured data processing, receipt data might be stored in the “natural” format in which it arrives (i.e., as a massive set of receipts). Whether these are electronic versions or scanned copies of paper receipts, the only structure assigned them would be those that are implicit in the collection (e.g., transactions might be naturally grouped together because they occurred at the same time or generated by the same individual). Since this data is to remain unstructured, no further ordering would be imparted through “post-processing.”

Structured data processing:

Structured data processing imposes some organization on the data contained in the receipted transactions (e.g., all transactions could be tracked by each customer, across all trips to any Walmart store). This would facilitate accessibility, search, and reasoning about a particular patron’s spending habits (e.g., all data might be organized by date and by store), which may allow the productivity of stores to be assessed and compared easily.

As previously mentioned, though structured data sets allow for easier organization and analysis of certain phenomena, they may make it more difficult to discover unknown relationships because of the structure itself. Remember: when analyzing data sets, one must always ask:

What data or relationships may be lost by imposing structure to the unstructured data set?

Screen Scraping

Sometimes we want to analyze data that is already formatted for human use. This data might be a text (e.g., a book), an image (like the one below), a complete website (like this one you’re reading), a formatted table/graph, etc. One method for extracting information from these data is called screen scraping.

Screen scraping (or data scraping) is the conversion of data formatted for human use to a format more easily used by automated computer processes. In particular, the conversion is usually oriented toward taking output produced for humans (such as display on a screen) and producing input for a computer to process (such as a file or the clipboard).

  • For example, take a screenshot:
    • Windows: PrtScn (Print Screen)
    • Mac OS X: Cmd + Shift + 3

Taking a screenshot converts what is seen on the computer display (typically, the entire desktop) to a machine-readable image format, such as in the example below. The file BtB-screenshot.png can now be processed like any other image file (e.g., cropped, rotated, etc.) as can be seen in the resulting BtB-description.png. In some use cases, an image file isn’t appropriate for the task and we want to extract the on-screen information as plaintext. We could then use Optical Character Recognition (OCR) software to convert the visual contents of BtB-description.png into another format, such as ASCII text. Now, we can manipulate the words and letters that we captured from the screen just like any other bit text (e.g., copy/paste, translate, etc.).

Filename Result
BtB-screenshot.png
BtB-description.png
BtB-description.txt Every day, billions of photographs, news stories, songs, X-rays, TV shows, phone calls, and emails are being scattered around the world as sequences of zeroes and ones: bits. We can’t escape this explosion of digital information and few of us want to—the benefits are too seductive. The technology has enabled unprecedented innovation, collaboration, entertainment, and democratic participation.

This video is an advertisement for a screen scraping software, but it provides a good overview of how screen scraping can be useful:

https://www.youtube.com/embed/s0erYF8MsyM

What other purposes of screen scraping can you think of?