Data Management and Analysis Core FAQs – Duke University Superfund Research Center

Content

Data Management FAQs
Data Analysis FAQs
Data Archival FAQs

Data Management FAQs

Why should I create a Data Management and Sharing Plan (DMSP)?

Taking the role of a data steward, you should always be able to describe the complete operational workflow for your research data, from data capture to data analysis, archiving, and sharing. A DMSP helps you to think this through. You are responsible for answering questions about the origin of your data, data manipulations, the location where the data are analyzed and archived, and with whom they are shared under what conditions.

If your research data contain personal information, it is essential to ensure that the privacy of the persons involved is protected during all phases of your research. When you share or link data with a third party, you need to take additional measures such as drawing up a legal agreement that is approved by your institute.

Who can access contact information?

In studies, contact data of study subjects are usually registered. Access rules should differentiate between those having access to research data and those having access to these contact data. In principle, one person should not have access to both, unless the researcher is also the treating physician. An exception can only be made for smaller projects that have a limited period during which data are created, processed, and analyzed. In your Data Management and Sharing Plan, you will have to argue why this exception applies to your research project.

What does data stewardship cost?

The costs of data stewardship should be included in your grant application or research budget. Explicitly specify the costs for:

Re-using another data collection
Building or using a database
Data management
Long-term preservation of data
Sharing data with others

Funders estimate that as a rule of thumb approximately 5-10% of the available budget is necessary for data stewardship.

How an I de-identify the data?

In general, you can protect the privacy of your study subjects by:

Keeping identifiable data separated from unidentifiable research data
Using random unique research codes and separating the code list from the research data
Encrypting vital identifying information

What should I consider when reusing data?

Before reusing data, you should ask yourself questions like:

Will these data help me answer my research question?
Is the quality and integrity of the data sufficient?
Are the data available under appropriate terms and conditions?
What technical measures do I need to take to use the data?
Is it wise to start a scientific collaboration with the data providers?

Which file formats are preferred?

You should use open, well-documented, flexible, frequently used file formats. Consult guides on preferred formats for long-term preservation and accessibility.

What safety requirements apply to my data?

You should install state-of-the-art security measures to prevent unauthorized and unnecessary access to your research data, to protect privacy and scientific integrity. You can do this by:

Setting access policies
Protecting data with passwords
Using firewalls, encrypted data transport, backups, etc.
Consulting your information security personnel
Performing risk assessments

Databases connected to the internet require additional security measures. Report any data breaches to appropriate personnel.

How should I organize my files?

Once you start creating and processing data, files can easily become disorganized. Naming and organizing your files consistently from the start saves time and prevents errors. Decide on conventions and share them with all people involved.

If you have many data files, keep a master list with critical information and links. Properly version this list so changes are tracked.

What data should I store?

At minimum, store your raw data used for publications, including metadata describing how you obtained and processed the data. Ensure metadata clearly describe which data they document. Also store any scripts used with datasets.

What should I do with intermediate files?

You may need to keep intermediate data files for reproducibility. If not needed, consider deleting them to save space and reduce privacy risks. You can also exclude them from backups. However, keeping intermediate data can be useful for traceability.

What metadata should I document?

Collect metadata that will help you and others understand, interpret, find, use, reproduce, and properly cite your data in the future. Important metadata may include:

Methodologies, protocols, instrument details, calibration data
Data quality indications
Descriptions of all data elements and files
Standards followed
Software and hardware used
Data provenance
People involved
Funding sources

How do I store metadata and research data?

Metadata and data should be stored close together so their relationship is clear. Some data formats allow embedding metadata.

How can I ensure my data is monitored and validated?

Consistently monitor data entry, documenting who enters/modifies data when. Validate data after entry by having a second person check it, comparing it to the raw sources, etc. Perform and document data quality checks during and after collection. Never let this process influence analyses.

How can I cite my data?

To enable citation tracking, provide your publicly available data with a persistent identifier (PID) like a DOI. Indicate in licenses or agreements that you want your data cited on reuse. Construct data citations similarly to article citations, including authors, titles, dates, PIDs, etc.

Data Analysis FAQs

How can I prepare my research data for analysis?

To enable transparent, reproducible analyses:

Create metadata documenting your raw dataset
Make a working copy of data, archive originals
Document all data cleaning/processing
Preserve raw and intermediate datasets

What analysis tool should I use?

For anonymized data, you can likely use any decent statistical software, provided processes are well-documented and manipulations scripted.

What versions of my data should I preserve?

Store raw data and processed versions representing meaningful, difficult-to-repeat steps. Keep what underlies your analyses and publications. Delete unnecessary intermediate files.

How can I create a data analysis plan?

For complex studies, create an analysis plan beforehand, addressing:

Research questions and data needed
Inclusion/exclusion criteria
Merging datasets
Missing data treatment
Statistical methods

What statistical method should I use?

Think carefully about hypotheses and alternatives before running analyses. Consult experts on appropriate statistical methods. Document all analysis decisions, scripts, and workflows thoroughly.

Data Archival FAQs

What is data archiving?

Scientific data archiving refers to the long-term storage of scientific data and methods in a secure, trusted repository.

How should I archive my data?

Adhere to FAIR Principles when archiving. Add datasets to field-specific repositories. Register data archived externally. List publicly available data in an open catalogue.

What is the difference between data storage and archiving?

Storage just saves data temporarily, while archiving ensures long-term retention, preservation and stewardship according to best practices.

What data should I archive?

Archive at minimum your data underlying publications and analyses. Archive all data that is impractical to reproduce, or that has significant research value or legal requirements for retention. Do not indefinitely retain transitory files.

How long should I archive my research data?

Minimum: 10+ years, as stipulated by policies and regulations
Maximum: Ideally, anonymized data is preserved indefinitely
Base your decision on reproduction difficulty, value, legal issues, costs

Get in Touch!

Do you have questions for the DMAC team? Fill out this form to get in touch with us!