November 19, 2019

Plan for today

  • Text formats and encoding
  • “Text as data” - why use text?
  • NLP workflow
  • Regular expressions
  • “Ask me anything”

Text formats

Revisited: Basic units of data

  • Bits
    • Smallest unit of storage; a 0 or 1
    • With n bits, can store \(2^n\) patterns
  • Bytes
    • 8 bits = 1 byte (why 1 byte can store 256 patterns)
    • ``eight bit encoding’’ - used to represent characters, such as represented as = 01000001

ASCII

Encoding

Solution: Unicode

  • Unicode was developed to provide a unique number (a “code point’’) to every known character – even some that are”unknown"
  • problem: there are more far code points than fit into 8-bit encodings. Hence there are multiple ways to encode the Unicode code points
  • variable-byte encodings use multiple bytes as needed. Advantage is efficiency, since most ASCII and simple extended character sets can use just one byte, and these were set in the Unicode standard to their ASCII and ISO-8859 equivalents
  • two most common are UTF-8 and UTF-16, using 8 and 16 bits respectively

Things to watch out for

  • Input texts can be very different
  • Many text production software (e.g. MS Office-based products) still tend to use proprietary formats, such as Windows-1252
  • Windows tends to use UTF-16, while Mac and other Unix-based platforms use UTF-8
  • Your eyes can be deceiving: a client may display gibberish but the encoding might still be as intended
  • No easy method of detecting encodings (except in HTML meta-data)

Document formats

  • Many different formats contain text
  • How many can you think of?
  • What problems are encountered?

Why use text?

Measuring unobserverables

  • psychological states
  • sentiment
  • “topics”
  • Ideology: “left-right” policy positions
  • corruption
  • cultural values
  • power

Example: budget debate