This is the final day. Our workshop participants have gone their separate ways or down to the pub for a final drink. It has been an exhilarating experience, and very challenging, but probably one of the most welcoming events that I have ever attended. One of the reasons for this seems to be that most people here are in the same state of openness in learning. Despite, or perhaps because of, the wide range of methods, many years of experience, and knowledge domains way beyond those that most humanists find comfortable, our willingness to accept new ways of thinking about and constructing (not just managing) data is very positive and productive for forging new partnerships and developing a greater sense of duty to share what we do with others.

This morning we examined the idea of copyright - something that we have taught before on our Arts and Humanities in the Digital Age programme. I really liked the case studies, which often involved web scraping and aggregation of data, and even where these may be permitted, the greatest issue seems to come from how these data are then used. When I work with research students on queries about their ethical approval applications, the use to which data are put, and the potential for released data to be reused in certain ways can make what seems like an unproblematic project ethically questionable. Just because we can do a lot with data, doesn’t always mean we should.

The second part of the morning was devoted to our own interests and those of our fellow participants. I joined a group which was looking at web scraping and collecting and cleaning OCR text. Three of us were, as it turns out, working on similar types of documents, but at different stages in a workflow, which would involve:

  1. Extracting raw text from a PDF document
  2. Isolating the relevant parts of that document
  3. Identifying and marking the entities that would then be used for further analysis

Jointly, we have agreed to stay in touch and share a workflow, which would involve combining at least two stages of the data preparation using pdfminer and then methods of cleaning involving regular expressions. The idea will be to make this workflow available on GitHub and to share with other researchers and students.

From these discussions, in the final session this afternoon, we also committed specific learning aims to a personal learning plan. This was tremendously helpful in thinking over the short, medium and long term. It will certainly be something I aim to take forward in discussions around my current academic role, but also in my responsibilities towards our Consortium and the Arts and Humanities in the Digital Age programme.

Our final event of the Summer School was a closing lecture by Dr Glenn Roe, whose work on eighteenth-century texts drew together many different elements of the entire event, and posed very interesting questions around the ontologies that we employ, or try to employ, today. Through a combination of different analytical methods, he demonstrated not only how past attempts to classify knowledge worked explicitly, but also implicitly (using topic modelling and discourse analysis), but how eighteenth-century writers incorporated texts and elements of texts through n-grams and skip grams, allowing us to track the verbatim reuse or paraphrasing of text through commonplace books and other publications. Essentially this is the same kind of technology that powers plagiarism detection software, used not on assessment, but on past knowledge.

In summary, DHOxSS 2018 has been a revelation, and although it is possible many scholars will not find many of the issues I raise in the last five posts pertinent to their line of work, what is undeniable is the growing amount of digital content that researchers will need to manage. The accessibility of data is not as great a barrier as it once was. The real challenge for the humanities lies in preparing individuals to navigate the affordances and constraints and the interface, and be able to confidently decide on how data can be extracted and structured and processed for storage/reuse, before we even reach the stage of conducting analysis. A very good point made on the final session was that curating/creating data is actually the kind of ontological, epistemological and ethical challenge that humanists should want to take on. 

My thanks to the University of East Anglia for allowing me to attend the event and to Eastern ARC for the generous funding of my place.