Data and Databases: Data Management and Storage

Emily Genatowski; James Baille

Data and Databases: Data Management and Storage

Authors

Topics:

Data management

Introduction

The data you generate in humanities and social science projects may well need longer term storage beyond the scope of your own research project. Medium to long term data storage is vital for allowing other scholars to examine and test your data and models, and ensuring open access to your data is an increasingly prominent issue and at times requirement from public funding bodies in many countries.

Data storage is a major issue for digital projects in our fields. This is the result of a number of technical challenges in the storage of the data, and the often highly specific nature of the data and its structures, as well as significant issues outside the data field itself to do with the allocation of funding and support for archiving and data storage issues. As a result of these issues, it is important to consider during your project what the potential issues might be in storing it for future use and for other scholars, and how you might need to plan your work to take account of these.

How do you plan for the data to be re-used or re-useable?
What is being stored, and in what format?
Who is responsible for the data, and where will it be stored?

We will consider each of these now in more detail throughout this resource.

Learning outcomes

After completing this resource, learners should be able to:

Articulate the benefits and drawbacks of data re-use
Understand how to store data to promote accessibility
Discuss data responsibilities of humanities projects dealing with databases

Data Re-Use

This area is the first one we must consider, because how you envisage the data being re-used dictates many of the answers to the questions that follow it.How could someone re-use your data? This is the broadest question, and perhaps the most important, as it includes both thinking about the processes needed to re-use your data, and the problems of data re-use. For someone else to use your data, they need to be informed about how the data were collected and what they mean, and they also need the data to be presented in an easy enough format for them to use and manipulate.

Are there reasons not to re-use some data? This might result from ethical or legal issues, if for example participants in a survey only gave consent for their responses to be used within the scope of a particular project or the data set includes data from other sources that you don’t have permission to re-share. You should also avoid sharing data in ways that allow living individuals to be identified and de-anonymised: there is more on this problem in the module on data ethics.

Are there data standards you should adhere to? There may be particular standards for how data are stored in your field that you need to make yourself aware of, and your data may be more useful if it shares or is described according to those standards. For example, in historical research, the large core CIDOC-CRM ontology is commonly used as a way of describing historical entity relationships, and describing your own data model in terms of how it fits with CIDOC-CRM may help other researchers use it in future.

Should users be able to browse the data? Data can be stored in quite simple data store formats, but the downside of these is that they are only accessible to researchers who are able to download the data and import it into their own research tools for re-analysis. Some data may be more helpful to scholars for re-use if it is presented in formats where they can browse individual records or use basic online tools to search for subsets of the data. This may be the case where a data set is suffi ciently large that being able to define a subset before downloading is useful, or where many of the interested users of the data set may not have the technical competencies or time to do their own analyses and may find browsing individual data points helpful.

What metadata are needed to explain the data? As we have covered elsewhere in this course, metadata are very important for ensuring that people understand what the data they are looking at mean and how they can properly re-use it. Your storage solution should ensure that metadata remain properly attached to your data and properly visible to people re-using your data, which can be especially important to consider if you are reformatting the data for storage purposes or providing a way to present the data via an interface.

What documentation is needed to explain the data? Often, humanities data can be of limited use to other scholars without a good idea of the research methods used to produce it. Like with metadata, when putting documentation in with data, it is also important to ensure that the documentation remains with the data, which can be difficult: anyone who accesses the data should also have access to the documentation, and your data management plan should explain how this will happen.

What to Store, How to Store it

In considering data re-use, we have already considered some of what needs to be stored: we’ve thought about the metadata and documentation we need, whether we need to standardise or describe our data according to data standards in our field, and where there might be parts of the data set that need to be removed or redacted from what we keep in storage. Now, though, we need to consider what the storage and future access system for the core data looks like on a technical level. Once again, we’ll go through some questions to consider when doing your planning.

Could subsets of your data have wider uses? There are major existing repositories for some sorts of data, especially geographical and place-name information, and contributing compatible parts of your data to such resources can be worthwhile and help show the wider utility of your project work. This will be unlikely to include all your data or otherwise reduce the need for you to consider other data storage problems, but data stored within wider projects is often particularly secure and this is an area worth considering when planning your data management.

What file format(s) will we use? Data provided in ‘flat’ formats like RDF can be an easier standardfor people to work with later, and can be easier to store in the medium term than putting something online as a queryable database. If you want to have copies of your data stored in other storage spaces, flat files may save space and be more robust to changing technologies in future. Conversely, a queryable database online with an interface can maximise other researchers’ access to your data. You may indeed wish to provide both –they have different use-cases.

Do you need an interface? You can provide data in a number of ways depending on your file formats, including just as the option of file downloads. However, you may want a web interface attached to your database so that people can browse and visualise your records without having to download the entire dataset themselves. If you do need a web interface, you will need to consider how this will be created and who will do so.

How much file space does your data take up? Databases that just contain text records can often be quite small, but database systems that contain large numbers of complete texts for frequency or corpus analysis may be significantly larger. If your database is a set of high detail image scans, meanwhile, you may need many gigabytes of file storage to store all of your data.

Data Responsibilities

The final set of problems in medium-term data management and storage arise from the practical issues of implementing your plan. You may know how you want your data to be stored, and you may have considered the necessary implications of re-use, but you still need to practically put that into action – and that means considering the people, institutions and infrastructure needed.

Who is responsible for the data? Whilst you have a lot of the initial responsibilities for your data collection and management, part of the principle of re-use is that the data should be stored in a more permanent location and someone, ideally an institution rather than an individual, therefore needs to take longer term responsibility for holding the data. Identifying who can take on that responsibility is important for long term storage. You should look at university computing services and libraries, national facilities for data storage, and discuss with permanent faculty members at your university, in order to assess the options here.

Where will the data be stored? Data need computer space to be stored, and if accessible online they will need server space and a place to put their interface. At least one responsible party needs to agree, and have sufficient resources, to provide a data storage location.

What are people responsible for? You need to clearly lay out the responsibilities different people or institutions will have for particular aspects of the data storage. Providing the server space, creating the content and interfaces are properly created to allow re-use, ensuring any redactions and changes to the data-set are properly made, and handling any future issues with the data (if you are providing for that possibility) may all be different roles.

What if things change? A robust medium term storage solution should consider the possibility of technical capacity and institutional or funding situations changing. For example, if a particular university body has your data and it shuts down in three years’ time, or if your interface relies on a current generation of web browsers for people to access it and these are no longer available in a few years. Storing data in multiple places can help counteract the latter of these problems: having “mirror” locations for the data, if possible, will help ensure its survival. Keeping an interface maintained, meanwhile, is more difficult.

How will this be funded? Retaining server space and sufficient maintenance to keep your data available may well not be cost-free. Check whether your university, or perhaps other academy or library facilities, have a programme available for data storage. If not, consider whether you may need grant funding to secure the server space that you need.

Conclusion

In this section we have covered most of the questions you need to answer when producing a data management plan. By working through these, you should have most of the core bases covered. Do ensure that you get someone else to look over data management plans you make –indeed, it is very important that you do so, especially when (as in most cases) the DMP requires other stakeholders such as your supervisor or your university computing services to have some implementation responsibilities.

Data management and long term storage is a genuine problem for researchersin digital humanities and social science work: it may be that it is simply not possible in all cases to create an ideal data management plan that ensures effective re-usability or storage of your data. The space and storage needed to do so may impose costs that you and your institution do not have the research resources to cover. However, these challenges make it more important to work through planning in this area, in order assess what you can achieve and how best to preserve your carefully constructed datasets for future use.

Data and Databases: Data Management and Storage

Introduction

Learning outcomes

Data Re-Use

What to Store, How to Store it

Data Responsibilities

Conclusion

Cite as

Reuse conditions

Full metadata

#Introduction

#Learning outcomes

#Data Re-Use

#What to Store, How to Store it

#Data Responsibilities

#Conclusion

Cite as

Reuse conditions

Full metadata

Introduction

Learning outcomes

Data Re-Use

What to Store, How to Store it

Data Responsibilities

Conclusion