November 19, 2019
Plan for today
- Text formats and encoding
- “Text as data” - why use text?
- NLP workflow
- Regular expressions
- “Ask me anything”
Revisited: Basic units of data
- Smallest unit of storage; a 0 or 1
- With n bits, can store \(2^n\) patterns
- 8 bits = 1 byte (why 1 byte can store 256 patterns)
- ``eight bit encoding’’ - used to represent characters, such as represented as = 01000001
- Unicode was developed to provide a unique number (a “code point’’) to every known character – even some that are”unknown"
- problem: there are more far code points than fit into 8-bit encodings. Hence there are multiple ways to encode the Unicode code points
- variable-byte encodings use multiple bytes as needed. Advantage is efficiency, since most ASCII and simple extended character sets can use just one byte, and these were set in the Unicode standard to their ASCII and ISO-8859 equivalents
- two most common are UTF-8 and UTF-16, using 8 and 16 bits respectively
Things to watch out for
- Input texts can be very different
- Many text production software (e.g. MS Office-based products) still tend to use proprietary formats, such as Windows-1252
- Windows tends to use UTF-16, while Mac and other Unix-based platforms use UTF-8
- Your eyes can be deceiving: a client may display gibberish but the encoding might still be as intended
- No easy method of detecting encodings (except in HTML meta-data)
- Many different formats contain text
- How many can you think of?
- What problems are encountered?
Why use text?
- psychological states
- Ideology: “left-right” policy positions
- cultural values
Example: budget debate