Due Date: Monday, October 29, 2018
Submission Instructions: Submit digitally (on Box) and be prepared to discuss in our next class
For this assignment, you will assemble and/or curate a collection of text files and metadata to use for your final paper. In digital humanities, practitioners of large scale or computational text analysis refer to an object of analysis as either a corpus (or plural, corpora) or a dataset, but the idea is basically the same either way. In order to write code to analyze a text, you need a collection of machine-readable files and, often, some additional information that helps you do things like train a machine learning model or compare to sub-groups of texts. There are various ways to assemble these kinds of collections, but this assignment is designed to walk you through a very common and easy-to-learn approach. The method I'm talking about doesn't have a specific name that I know of, bit it's how almost everyone who does this kind of analysis handles their files and folders. This method is also meant to work in tandem with the research design principles described in Ignatow and Mihalcea, "Research Design and Basic Tools," from An Introduction to Text Mining: Research Design, Data Collection, and Analysis.
Building on Ignatow and Mihalcea, some common steps in DH research design might include:
Most likely, you did steps 1 through 3 when you wrote your research proposal. Some of you began the crucial process of thinking about a qualitative or quantitative way of addressing your question, as well as how you might operationalize your method computationally. The focus of this assignment, of course, is to create your corpus, but you may need to do some more thinking about steps 4 and 5 before you knopw exactly what kind of corpus you need.
The main idea of this step is to connect the texts you collect as directly as possible to your research question. Simultaneously, what you collect and what form it should take must be appropriate for the kind of measure or analysis you want to do. If you want to study the history of 19th-century book reviews (as I am doing in my work), it doesn't make sense to collect Goodreads reviews, unless I'm planning to write a method or proof-of-concpet paper before I collect reviews from my time period. If I want to study Native American crime fiction, I need texts that belong to that subcategory and, if my question focuses on if/how Native American crime fiction differs from "mainstream" white American crime fiction, I must also collect a corpus of white American crime fiction for comparison.
All this said, some topics have to be approached in pieces or from the side. Coulson, Buchanan, and Aubeeluck, for example, used online support groups for Huntington's disease as a case study in the types of support online community offer group members. Huntington's was chosen in part because, "Due to the nature of the symptoms, the genetic element of the disease, and the fact that there is no cure, patient's with Huntington's disease and those in their support network often experience considerable stress and anxiety" (Ignatow and Mihalcea 147). Coulson, Aubeeluck, and Buchanan have gone on to publish additional pieces on oneline support groups, Huntington's, and other aspects of patient psychology.
Remember to consider what's appropriate, as well as what's practical. In some cases, you may even opt to change your research question to fit the data you can find or digitize within your time constraints. The most important thing is to develop a research plan with a solid chance of generating some kind of insight that is meaningful to you and your scholarly community.
The following tree of folders and files would all be located in one Github repository, which would represent the base folder or starting point:
|-- .gitignore |-- application |-- some-functions.py |-- other-functions.py |-- lexicon |-- lexical-data.csv |-- main.ipynb |-- meta |-- metadata.csv |-- other-meta-if-needed.csv |-- README.md |-- requirements.txt |-- tables-and-figures |-- txts |-- file1.txt |-- file2.txt |-- etc.txt
From within this structure, the most important locations for the corpus assignment are:
|-- meta |-- metadata.csv |-- README.md |-- txts |-- file1.txt |-- file2.txt |-- etc.txt
If you get these three things right, everything else can come later (and a few of the files and folder listed above are essentially optional, depending on what kind of analysis you're doing). However, it's crucial that you set up the metadata, README, and txts sections at this stage because it will save you a great deal of time and hassle later on in the process.
In general, this spreadsheet is where you would put author, title, date, genre, and other data about your texts. It's also where you would store training labels for later use. You can include fields for almost any kind of data point, but you should follow these general guidelines for the spreadhseet:
In this section, you will record and explain to others how you collected your texts and assembled your metadata, including what kinds of choices you made and what any abbreviations in other places might mean. For example:
I've already mentioned the most crucial aspect of the folders in this file, but it bears repeating. Each file in the folder should have a filename that's represented in your metadata.csv file. Each row should have its own file, and each row should have only one file. If you're working with a file format where you only want part of the file (such as the "body" field of an xml file), you should make note of this in your README file.
Coulson, Neil S., et al. "Social Support in Cyberspace: A Content Analysis of Communication within a Huntington's Disease Online Support Group." Patient Education and Counseling, 68.2 (October 2007): 173–178. PubMed, doi:10.1016/j.pec.2007.06.002.
Ignatow, Gabe, and Rada Mihalcea. An