Over the last few weeks, I have found myself looking at words on two very different scales. On the one hand, I’ve obsessed over single words, words of the five-letter variety to be specific. ¡POETA! On the other hand, I’ve been increasingly interested in words on a much larger scale: words as searchable data in the 2,500+ word range. Actually, the dual focus started as one as I looked for ways to create an indexable list of WORDLE words searchable by letter position. As the lists got longer, things bifurcated: I became interested in looking at how many and how many unique words are used in a given set. These sets have been my writing, colleagues’ writing and, most recently, the “Rimas” from Gustavo Adolfo Bécquer.

This represents some next level thinking and tinkering, for a secondary Spanish teacher at least. The quest for some searchable patterns has led me to some formulas that separate strings of text into cells, allow me to transpose rows to column, and then identify the frequency of words. This has allowed me to look at writing more analytically and on a much larger scale. It also has me using log scales, which I have not touched since high school. Beyond logarithmic scales, these experiments have me thinking about the words we use and reasoning behind them.


SECTION 1 – This section is about my how and why, so if you are bored and want to get to Bécquer, you can skip ahead. Don’t be so quick to get to the “golondrinas” that you miss the formulas, however.

As I said, I first pull up he table on my own writing, starting with a single section (10 students) and then adding another (12 students). This is what it looked like:

To get to that graph, I had to do four things. First, I had to paste the text into a single cell in Google Sheets. Second, I had to distribute each word in the text to a single cell. I did that using the formula =SPLIT(A1,” “). This formula separated each word based on the space that separated them. The first hurdle was prepping the number of columns to accommodate the number of words. This meant duplicating hundreds at a time. The second hurdle was accounting for dashes–which I use frequently. Third, I had to transpose the row of words into I column using the Edit > Paste special… > Transposed function. I then had to use the most complicated function: ={UNIQUE(A2:A1503),ARRAYFORMULA(COUNTIF(A2:A1503,UNIQUE(A2:A1503)))}. This function created a list of all the unique words and provided a count. I had to account for some duplicates based on upper and lower case. I found this easier to do with conditional formatting connected to this formula: =countif(A:A,A1)>1. Once I had a clean count of the words, I inserted a scatter chart, setting the vertical axis to Log scale. This showed me that I was describing “goals,” “ideas,” “team,” “work” and “discussion.” I wasn’t overly adverby and I was speaking from the “I” perspective. I was happy to see that my output seemed to align with my goal. It is one thing to scan to see that, but the stats say it clearly.

As I said, one section became two and then I turned the same process on the comments for a single advisee. I was curious to see what patterns would emerge when the student–not the teacher–was the constant. Without going into much detail, I can say that assessment of performance/progress and process were in line. This is also a way to check for consistency related to spellings, pronoun usage, tone and other factors. Does this replace proofreading? No. Is it preferable? Perhaps.

We’ll transition to Bécquer in a second, but not before taking on a question that you may have. Why in the wide world would I want to do this? Wouldn’t word count and a thesaurus get me most of the way? Wouldn’t a word cloud represent the same data in a way that was more fun and less formulaic? In a word, “no.” What this process allowed me to do is search the words, sort the words, shuffle them back to their original order and, of course, represent them graphically in different forms. A colleague of mine found this next-gen word counter that can give word count, unique words. longest word, syllable count, keyword density and some other fun facts. This is free and easy, but it lacks graphical representation and the ability to search word location inside a text.


SECTION 2 – This section is go-time for “golondrinas,” and lays out reasons why beginning with Bécquer was maybe not the best idea.

For whatever reason, Bécquer was the first writer that jumped to mind. I guess this is part poetic elegance and part concentrated output. I chose Legends, Tales and Poems which I found on Project Gutenberg. (I should say that this is the best available archive in terms of holdings and text formatting.) I chose the Poems selection of the work, though this was somewhat foolish for reasons related to formatting. The line spacing and punctuation made it so I had to clean up a lot of cells in the Sheet. A cell that reads “muertos!!_»” causes problems and need to be simplified to “muertos” to make sense. Now, on subsequent testing on a prose piece, “El rubí” by Rubén Darío, I had different formatting issues: multiple words showed up in the row an then the transposed column, so it took some clean-up.

Back to poetry, quickly. Clearly, this project flattens and uncouples verses is a way that is either illuminating or off-putting. It is also worth mentioning that prior to taking on this project, I had never reached the maximum cell number on our edu domain.

Big data, indeed. Depending on the number of words in the text, one may have to think creatively about how to split the test into a single row and then transpose them into columns out beyond the “CCC” count. I’m curious to learn what my programmer colleagues could do to automate this process. Food for future exploration.

The selection of Rimas had 2,749 words. Of those 2,749 words, I determined that 1,078 were unique. Poetry! That would actually be an interesting series of questions (metrics) to begin with: how many unique words? what are they? are there any patterns? The graph above represents breadth of Bécquer’s work. As a complement to analysis of meter and meaning, this give us a way to visualize the work(s). As a pre-reading activity, this could inform some “What are we going to see?” and “What do we want to know?” These are questions related to vocabulary and discourse. As a post-reading activity, this could inform some reader response questions like “What were the words and/or images that stuck with you?” and “Were those were the statistically significant ones?” and “Is there a relationship between these two questions?” and “Does there need to be?”

Now there are next-level questions that we could take on with this process. What shape do different poems have? How does the shape works from different poets compare? How do the shapes of prose pieces compare to poetry? Again, this is not meant to replace traditional analysis not is it meant to replace greats like Bécquer with bots. I hope it can be a both/and situation. Poetry is pleasure reading. Poetry is language in its purest form. Poetry is rhythm and rhyme, meter and meaning.

You… are poetry. Poesía… eres tú.


While I have shared the charts here, and while most of my reflections connect to the representation of the data, the data itself sings. Need to find the number of words…? or the number of occurrences…? or sort based on word order…? or alpha order…? A Sheet set up the way I described above will provide the power to to just that. Now, another project for another day would be some kind of database from which you could pull data from different works from different periods. This might allow a curious soul to see how certain words were used more or less over certain periods. Now, Big Data and poetry are not currently partners. Whos is to say they can’t be? In addition to the analysis above, I can also tell you that in Massachusetts, the word golondrina scores an 18 (24th place) on Google Trends scale from 1/1/04 – 2/9/22. Texas is first with score of 100. The comparison may seem odd here, however it is part of an ongoing inquiry into the nature and meaning of the word we use.

Data is flying around us all, and it will come back. That is certain!

This level of analysis has previously been available to educators, K-12+ via AATSP’s Hispania and other industry publications for decades. What sets this framework apart is the ability of anyone with access to text and Sheets to complete their own analysis. As outlined above, this can be a close study of a literary treasure, a collection of treasures or an educator produced sample. Additionally, the ability to look at the data along customizable ways is as simple as it is informative.


For more on the intersection of high tech and the humanities, explore these previous posts:

Another Peek into the Potential of Programming in the Humanities

A Peek into the Potential of Programming in the Humanities

Story Seeds and a Surprise

Chatting and Changing the Subject