The abundance of stop words and OCR noise in The College News predicates that the topic modeling tool, when analyzing our corpus, won’t always produce meaningful results. Therefore, we have chosen to only visualize topics containing meaningful content. There are two topics excluded from the line chart due to their lack of specific information: “2. college mawr bryn news ee pa ardmore 5 ave editor ae page 3 year 2 1 company lancaster wednesday weeks” and “9. page 30 8 1 continued 4 2 room 5 col 3 00 7 music april march 15 6 common friday”.
Our timeline starts from years 1950 and ends in year 1968, when the College News merged with Haverford’s student newspaper. We have chosen this time period for our political topics visualization because we are aware of the history backdrop of the Cold War, Vietnam War, and civil rights movement, and therefore interested in investigating the political awareness and engagement in the Bryn Mawr community during this period of political and social change.
The relative frequency of a topic in a chunk of text refers to the number of words dedicated to this topic divided by the total wordcount in a given chunk. The Topic Modeling Tool computes the relative frequency of topics per chunk as a part of its output. We then calculated the relative frequency of topics each year by finding the mean of relative frequency in chunks published in that year.
Topic Modeling allows us to analyze our large corpus through extracting topics. A “topic” consists of a cluster of words that frequently occur together. This form of computational text analysis relies on an algorithm called “Latent Dirichlet Allocation” and contains an aleatory aspect. Therefore, despite using the same corpus as the input, the output of topic modeling will be slightly different every time.
We uploaded our entire corpus into the topic modeling tool and adjusted the setting to divide the text into chunks of 500 characters. In doing so, we allowed the topic modeling tool to identify words appearing in each other’s vicinity instead of simply coexistent words in the entire College News issue. The topic modeling tool generated 10 topics for the whole corpus. We identified one of these topics to be political and used its cluster of words to run keyword search through all text chunks. All chunks containing one or more of these keywords were sorted into a separate directory. This directory is used as the new input of the topic modeling tool in order to obtain more politics-focused results. The final output contains three political topics out of ten in total.
We graphed the data using Altair, a python library for data visualization based on Vega and Vega-lite. We chose a line chart because we would like to investigate whether a pattern or trend exists in a given topic.
Topic | Most common words |
---|---|
0. | political policy president government conference alliance party gov mr foreign united news states discussion general affairs national issue state issues |
1. | american africa country europe south world west east york travel french home days time city african france _ trip air |
3. | world mr war people united states man peace great country con problem life time today nations ing china problems ed |
4. | school scholarship york scholar high prepared mary pennsylvania ann elizabeth jane alumnae national regional college anne nancy rhoads barbara city |
5. | students bryn mawr student college year campus work committee program haverford colleges group class service school faculty ing summer stu |
6. | civil state government rights negro people ment law public social american south action labor tion white aid education support union |
7. | play world good man show time ing life love story audience don men played cigarette part stage great back young |
8. | mr university american dr professor history science miss department art english work political research study philosophy de college mrs mawr |