Text Corpus for Final Paper Assignment

Due Date: Monday, October 29, 2018

Submission Instructions: Submit digitally (on Box) and be prepared to discuss in our next class

Assignment Description:

For this assignment, you will assemble and/or curate a collection of text files and metadata to use for your final paper. In digital humanities, practitioners of large scale or computational text analysis refer to an object of analysis as either a corpus (or plural, corpora) or a dataset, but the idea is basically the same either way. In order to write code to analyze a text, you need a collection of machine-readable files and, often, some additional information that helps you do things like train a machine learning model or compare to sub-groups of texts. There are various ways to assemble these kinds of collections, but this assignment is designed to walk you through a very common and easy-to-learn approach. The method I'm talking about doesn't have a specific name that I know of, bit it's how almost everyone who does this kind of analysis handles their files and folders. This method is also meant to work in tandem with the research design principles described in Ignatow and Mihalcea, "Research Design and Basic Tools," from An Introduction to Text Mining: Research Design, Data Collection, and Analysis.

Mapping Research Design Practices to This Assignment

Building on Ignatow and Mihalcea, some common steps in DH research design might include:

  1. Choosing a topic
  2. Exploring a scholarly field and an evidentiary archive
  3. Generating a research question or problem
  4. Planning a qualitative or quantitative way of addressing the question
  5. Operationalizing your method computationally (with code or software)
  6. Locating or creating a corpus
  7. Executing the computation on the corpus
  8. Evaluating and interpreting results

Most likely, you did steps 1 through 3 when you wrote your research proposal. Some of you began the crucial process of thinking about a qualitative or quantitative way of addressing your question, as well as how you might operationalize your method computationally. The focus of this assignment, of course, is to create your corpus, but you may need to do some more thinking about steps 4 and 5 before you knopw exactly what kind of corpus you need.

What to Collect and How

The main idea of this step is to connect the texts you collect as directly as possible to your research question. Simultaneously, what you collect and what form it should take must be appropriate for the kind of measure or analysis you want to do. If you want to study the history of 19th-century book reviews (as I am doing in my work), it doesn't make sense to collect Goodreads reviews, unless I'm planning to write a method or proof-of-concpet paper before I collect reviews from my time period. If I want to study Native American crime fiction, I need texts that belong to that subcategory and, if my question focuses on if/how Native American crime fiction differs from "mainstream" white American crime fiction, I must also collect a corpus of white American crime fiction for comparison.

All this said, some topics have to be approached in pieces or from the side. Coulson, Buchanan, and Aubeeluck, for example, used online support groups for Huntington's disease as a case study in the types of support online community offer group members. Huntington's was chosen in part because, "Due to the nature of the symptoms, the genetic element of the disease, and the fact that there is no cure, patient's with Huntington's disease and those in their support network often experience considerable stress and anxiety" (Ignatow and Mihalcea 147). Coulson, Aubeeluck, and Buchanan have gone on to publish additional pieces on oneline support groups, Huntington's, and other aspects of patient psychology.

Remember to consider what's appropriate, as well as what's practical. In some cases, you may even opt to change your research question to fit the data you can find or digitize within your time constraints. The most important thing is to develop a research plan with a solid chance of generating some kind of insight that is meaningful to you and your scholarly community.

Other Factors to Consider

  1. Inclusion and exclusion criteria (ideally concrete and not purely subjective, such as "texts I think are canonical"; think historically )
  2. Corpus Size (especially if using a quantitiative method with a scale requirement)
  3. Representativeness of the corpus (who is represented, what or who might be left out)
  4. Sampling strategies (especially if sampling from "big data" corpus)
  5. Type of text data needed
    • Most often you will find text data in one of three formats: term frequency tables, unprocessed full text, or "tagged" text such as TEI documents. I will say more in class about what kind of data you need for various computations, or you can meet with me individually if you have a specific question about a corpus.
  6. Required metadata and/or labeling (this can be very time intensive!)

Some Strategies

  1. Upcode or subset an existing dataset/corpus
  2. Assemble from a large archive of texts such as HathiTrust, the Internet Archive/Open Library, Project Gutenberg
  3. Extract full text from public domain or purchased eBooks (do not violate copyright!)
  4. Scrape or download from a primary source such as Twitter (using Twitter API)
  5. If no other option: scan and OCR books yourself

Collection Structure

The following tree of folders and files would all be located in one Github repository, which would represent the base folder or starting point:

Files and Folders: Ignore Most of This For Now

|-- .gitignore
|-- application
    |-- some-functions.py
    |-- other-functions.py 
|-- lexicon
    |-- lexical-data.csv
|-- main.ipynb
|-- meta
    |-- metadata.csv
    |-- other-meta-if-needed.csv 
|-- README.md
|-- requirements.txt
|-- tables-and-figures
|-- txts
    |-- file1.txt
    |-- file2.txt
    |-- etc.txt

Essentials: Don't Ignore This

From within this structure, the most important locations for the corpus assignment are:

|-- meta
    |-- metadata.csv
|-- README.md
|-- txts
    |-- file1.txt
    |-- file2.txt
    |-- etc.txt   

If you get these three things right, everything else can come later (and a few of the files and folder listed above are essentially optional, depending on what kind of analysis you're doing). However, it's crucial that you set up the metadata, README, and txts sections at this stage because it will save you a great deal of time and hassle later on in the process.

Create metadata.csv File (inside a folder called meta)

In general, this spreadsheet is where you would put author, title, date, genre, and other data about your texts. It's also where you would store training labels for later use. You can include fields for almost any kind of data point, but you should follow these general guidelines for the spreadhseet:

  • Each "Thing" You're Studying Gets its Own Row (could be novels, short stories, songs, or chapters)
  • Each Data Point Gets its Own Column (author_lastname, gender, etc)
  • Make column names simple and avoid spaces
  • One column in your spreadsheet should have a unique item ID called 'id' (even if it's just numbers 1-30)
  • One column should "point" to a corresponding text file (for example, in the first row we have the id of 1, and the "txt_file" column has novel_1.txt as the filename. Meanwhile, the file in the txts folder named novel_1.txt is the full text of the same novel represented in the first row.
  • Occasionally, you will want two columsn to represent the same data in more than one way. For example, you might have a "gender" column with "m", "f", "multi", "unknown", "trans", "nonconforming" etc. but you might also have a column called "is_female" that's just 0 or 1. (In programming, 1 is yes and 0 is no.)
  • Use README.md to Explain Your Choices and What Everything Means (a.k.a., as your "Codebook")

    In this section, you will record and explain to others how you collected your texts and assembled your metadata, including what kinds of choices you made and what any abbreviations in other places might mean. For example:

  • Did you build on top of of your draw your files from someone else's dataset or corpus?
  • How did you decide on your metadata columns?
  • What categories did you use? Did you use someone's else's standardized fields (e.g., Library of Congress Classification category)
  • How did you categorize edge cases or complex examples? (e.g., multigenre work in an analysis of genres)
  • What was your inclusion criteria for texts in this collection? (e.g., all Pulitzer Prize Winners, 1922-1942)
  • What was your exclusion criteria? (e.g., any novel shorter than N words, any book of short stories, any multi-author work)
  • txts Folder and txt files

    I've already mentioned the most crucial aspect of the folders in this file, but it bears repeating. Each file in the folder should have a filename that's represented in your metadata.csv file. Each row should have its own file, and each row should have only one file. If you're working with a file format where you only want part of the file (such as the "body" field of an xml file), you should make note of this in your README file.

    Works Cited

    Coulson, Neil S., et al. "Social Support in Cyberspace: A Content Analysis of Communication within a Huntington's Disease Online Support Group." Patient Education and Counseling, 68.2 (October 2007): 173–178. PubMed, doi:10.1016/j.pec.2007.06.002.

    Ignatow, Gabe, and Rada Mihalcea. An