Today’s challenge was to tackle two tools in the cleaning of ‘dirty’ data: Open Refine and SQLite Studio. My initial reaction to Open Refine is its utility for speeding up tasks that academics and research students could easily spend days or weeks on; the kinds of cleaning tasks that are inevitably prone to human errors and inconsistencies in an attempt to fix the (normally) human errors and inconsistencies.

There is also the issue that any intervention by data cleaning also relies on human supervision and checking - in particular being able to check the recommendations that are created by algorithms. Also important is understanding how these algorithms function so that attempts to clean large datasets can be productive - that there is a level of tolerance for, say, clustering and use that is appropriate for the dataset.

With SQLite has the added challenge of figuring out what expressions we needed to provide to allow the programme to return our queries from a particular table. Learning to use SQLite is also about learning to write effective queries, which if we scale up (in the manner of the knowledge models covered yesterday) to the level of research questions, makes a lot of sense. Although many social scientists working with quantitative methods and experimental research design, will learn to shift between these levels in their work, those in the humanities unfamiliar with data science will find this more difficult to do.

In the case of both Open Refine and SQLite Studio, one of the key features was the ability to log the different transformations and queries. These logs provide crucial records of each step of data cleaning, and it would be helpful to develop exercises that help research students adapt these logs for evidential statements of method where required.

This afternoon’s session focused on the idea of preservation, which increasingly academics and their research students must come to terms with. So many digital humanities projects - some on a grand scale - have vanished despite great investment of money and effort. The data, the analytical methods, and the findings of projects no longer exist, or are no longer accessible to others.

Katrina Fenlon took us through three different models of digital humanities projects to better understand their goals and where the emphasis on digital preservation would need to be. Perhaps because such projects are complex technically to implement, it is also difficult for those involved to decide how and what should be preserved, e.g. a unique, value-added interface over digital images where the latter can be obtained easily from other institutions.

This evening’s lecture was presented by Dr Athanasios Velios at Oxford on integration and linked data for museums and library collections. We have had a lot about linked data all week: the model of ‘triples’ (subject-predicate-object) was introduced at the very start. This lecture was helpful in visualising the ontological frameworks that institutions and researchers are putting together to allow at least some agreement on the nature of a our world of things. For those who have worked in libraries and museums professionally, these schemes may be familiar, but for many researchers whose expertise lies in the objects themselves and the interpretation of them, description can seem over-laborious. However, as the speaker demonstrated, it is only be effective descriptive frameworks and practices that the vast sum of knowledge around cultural heritage can by linked and exploited.