Insight Data Management

File Organization

The directory structure of your submittal can significantly contribute to understanding the data package and efficient consumption of the data. File organization should move from general information in the top directories to increasingly specific information in the sub-directories. This section outlines a number of recommendations to support the development of a well-formed data package.

Organize Files

Organize files into a logical directory structure, and not as a loose set of files within a single folder.

Studies typically follow a progression of data evaluation from raw to summarized findings.
Group files based on Project, Assay (or data type), and ancillary or metadata. (see Figure 1)

Primary Level Directory Folder

The primary level directory folder should contain:

file_location	size_bytes
/project/assay/raw/missionX_subjectA_assay.tar	3139235840
/project/assay/raw/missionX_subjectB_assay.tar	2713293944

md5sum	file_location
957c168884ccc1dbfb0e1028ffd1e53e	/project/assay/raw/missionX_subjectA_assay.tar
af1b39494c2c66a21add7e80a3c5d7d3	/project/assay/raw/missionX_subjectB_assay.tar

Secondary/Tertiary Level Folder

Secondary/tertiary level directory folders should contain:

Files representing different levels of abstraction grouped within their own folder.
1. Folders with data should only contain files of the same type. Do not intermix different file types within the same folder (e.g. all files related to an RNASeq assay should be contained within a single directory structure).
Nested folders of related data should contain README files to more completely describe the structure of each file type.
Data integrity checks (MD5 Checksum) files may be included at each directory level.

Additional Files

Additional files – such as summary findings, reports and statistics – may be included within the primary or secondary level directory folders, but not intermixed with data files. A sample directory structure is provided in Figure 1:

File Naming Convention

Descriptive file names are an important aspect of organizing, sharing, and managing data files. Develop a naming convention based on elements important to the project. Additionally, do not reuse simple file names multiple times in different directories. Ideally, use file names that are unique to the data set. This section builds on the guidelines of the Princeton University Library best practices for file naming. The most important aspects to remember about file naming are to be consistent and descriptive in naming and organizing your files so that it’s obvious where to find a file and what it contains

File Naming Best Practices:

Files should be named consistently.

File names should be descriptive and not too long (<64 characters)

Do not use special characters or spaces in file names

Use Capitals and underscores (‘_’) instead

Use ISO 8601 date format: YYYYMMDD.

Include a version number, where appropriate.

Write down naming convention in the README file

Elements to consider in the naming convention:

Date of creation

Short description

Location or institution

Project or mission name

Sample name or type

Analysis method

File Naming Examples: (Just examples, specific convention up to the investigator)

Excel spreadsheet example:
1. <grant#>_<experiment-id>_<desc>.xslx
MRI images taken in different positions:
1. <university>_<study>_<type>_<position>_<yyyymmdd>.dcm
Human genome sequence:
1. <institution>_<platform>_<mission>_<subjectID>_<sequenceNumber>_<yyyymmdd>.fastq
2. <institution>_<platform>_<mission>_<subjectID>_<yyyymmdd>.bam
3. <institution>_<platform>_<mission>_<subjectID>_<VCFtype>_<yyyymmdd>.vcf

Metadata Guidelines

Best Practices

Create one README file for each data type whenever possible. It is also appropriate to describe a "dataset" that has multiple, related, identically formatted files, or files that are logically grouped together for use (e.g. a collection of Matlab scripts). When appropriate, also describe the file structure that holds the related data files.
Name the README file so that it is easily associated with the data file(s) it describes.
Write your README document as a plain text file, avoiding proprietary formats (such as MS Word) whenever possible. Format the README document so it is easy to understand (e.g. separate important pieces of information with blank lines, rather than having all the information in one long paragraph).
Format multiple README files identically. Present the information in the same order, using the same terminology.
Use standardized date formats. Recommended format: W3C/ISO 8601 date standard, which specifies the international standard notation of YYYYMMDD or YYYYMMDDThhmmss.
Follow the scientific conventions for your discipline for taxonomic, geospatial and geologic names, and keywords. Whenever possible, use terms from standardized taxonomies and vocabularies

Description

README File Contents — Description (Minimal recommendations in bold)

Provide a title for the dataset
Name/institution/address/email information for
1. Principal investigator (or person responsible for collecting the data)
2. Associate or co-investigators
3. Contact person for questions
Date of data collection (can be a single date, or a range)
1. GMT Standard (not local time zone)
Information about geographic location of data collection
Keywords used to describe the data topic
Language information
Information about funding sources that supported the collection of the data

bold

File Contents — Data and File Overview (Minimal recommendations in bold)

For each filename, a short description of what data it contains
Format of the file if not obvious from the file name
1. See Open and Proprietary Encoded Data Requirements below
If the data set includes multiple files that relate to one another, the relationship between the files or a description of the file structure that holds them (possible terminology might include "dataset" or "study" or "data package")
Date that the file was created
Date(s) that the file(s) was updated (versioned) and the nature of the update(s), if applicable
Information about related data collected but that is not in the described dataset

README File Contents — Sharing and Accessing Information (Minimal recommendations in bold)

Restrictions placed on the data
Links to publications that cite or use the data
Links to other publicly accessible locations of the data
Recommended citation for the data, especially if some parts of the data persist in a data service other than the LSDA

README File Contents — Methodology (Minimal recommendations in bold)

Description of methods for data collection or generation (include links or references to publications or other documentation containing experimental design or protocols used)
Description of methods used for data processing (describe how the data were generated from the raw or collected data)
Any instrument-specific information needed to understand or interpret the data
Standards and calibration information, if appropriate
Describe any quality-assurance procedures performed on the data
Definitions of codes or symbols used to note or characterize low quality, questionable info or outliers people should be aware of
Point(s) of contact for sample collection, processing, analysis and/or submission

README File Contents — Data Specific Information (Minimal recommendations in bold)

Count of number of variables, and number of cases or rows
Variable list, including full names and definitions of column headings for tabular data (and spell out abbreviated words)
Units of measurement
Definitions for codes or symbols used to record missing data
Specialized formats or other abbreviations used

README File Contents — Examples of Metadata Standards: Table 1

Source	Content
BioPortal	Biomedical ontologies – comprehensive resource for molecular, biological, and medical ontologies
Integrated Taxonomic Information System	taxonomic information on plants, animals, fungi, microbes
NASA Thesaurus	engineering, physics, astronomy, astrophysics, planetary science, Earth sciences, biological sciences
GCMD Keywords	Earth & climate sciences, instruments, sensors, services, data centers, etc.
USGS Thesaurus	agriculture, forest, fisheries, Earth sciences, life sciences, engineering, planetary sciences, social sciences etc.
Getty Research Institute Vocabularies	geographic names, art & architecture, cultural objects, artist names

Investigation-Study-Assay Metadata

LSDA is transitioning to an Investigation-Study-Assay (ISA) framework for metadata which will provide detailed context and descriptions of experiments within the archive. These metadata enhance data discoverability within the archive database, and ensure the data can be efficiently exchanged and integrated for use by future research.² Specifications for submitting data in compliance with ISA documentation will be forthcoming

For now, please see the ISA documentation for an introduction to the topic: https://isa-specs.readthedocs.io/en/latest/isamodel.html

Data File Formatting Guidelines

This section will address the requirements and recommendations for the structure and description of the two most common forms of data:

User created columnar / tabular data.
1. Spreadsheets and text data files.
Encoded data.
1. Unique file format based on a specification.
2. Commonly found in images and vendor specific file formats.

Columnar/Tabular Data

The most common form of a data file is a tabular data frame, like a spreadsheet. The table form consists of rows and columns of information. Each column contains information regarding a specific attribute, and each row consists of various data related to an entity or observation. The columns have headers that describe the data in each column should represent a specific variable with a well described information type and structure. At a minimum, the rows should have a unique index, made up of one or more columns.

In general, data files should be machine-readable first and human-readable second. The data should be developed for pulling the information into analytical systems (such as Jupyter Notebooks, Matlab, etc.) and not for presentation purposes.

Examples: tab or comma delimited text files, Microsoft Office Excel spreadsheets, SAS/STAT data, etc.

Columnar/Tabular Data Requirements

Columnar/Tabular Data Recommendations

Adherence to Tidy Data guidelines for data structure is recommended, in which:
1. Columns represent variables
2. Rows represent observations
3. Tables consist of observational units
Individual text files are preferred over Excel Workbooks.
1. Text files simplify automated processing of data files and reduce errors introduced by Excel encoding⁴
Tab delimited columns is preferred over comma delimited, as commas are often included within single data fields
Multiple data files of single data sets are better than large single files representing many data sets.
Descriptive document header information may be included within the file, before the start of the data, which is common in open data specifications.
1. Document header information consists of instructions, metadata, and/or annotations to inform the use of the data.
2. The term “document header” implies it occurs prior to the data. All comments and annotations should be placed before any data and not be intermixed with data rows.
3. Document header information should be identified by a non-data character as the first character in the line (i.e. “#” or “##” will indicate the line represents a comment or metadata).
  1. Example:
    # This is a comment line and may include non-data information
    # Comment lines must occur before the first line of data
  2. If metadata elements, column descriptors, or other reference terms are included in the document header, these elements should adhere to a standard format to automate information extraction.
  3. Data annotations may be included, such as conditions on certain information, file specification links, or other information pertinent to the data file.

Encoded Data (Open or Proprietary)

Encoded files consist of data that adhere to a specific encoding and format that must be read using a specific software or must be decoded using a file format specification. Many reasons exist to encode a file. Most software vendors employ unique file encodings to ensure compatibility with and use of their software. Additionally, data may be encoded for the purpose of addressing specific data needs, obscuring information, and reducing file sizes, to name a few.

Image files represent a common type of file based on a file encoding. In order to use encoded data, a data specification must be published so that consumers of the data will know how to decode the data. In the case of common files such as images, file reader tools should be available so that data users are not required to develop their own software. In the case of text encoded data, these data should adhere to common standards, such as base64 encoding, which can easily be handled within most programming languages.

Examples of commonly used file specifications:

media files (image, audio, video)

Microsoft Office files (.doc, .xls[x], .ppt)

MatLab and SAS/STSS

executable files

compressed files

genome sequence BAM files

Mass Spec RAW files

etc.

Open and Proprietary Encoded Data Requirements

Encoded data files must adhere to an open file standard, except for the following cases:
1. The non-open standard is very commonly used by the research community, and tools are readily available to read or transform the data.
  1. Example – MS Office documents, standard image formats, etc.
2. There are no open alternatives to the represented data.
A description of each type of encoded data must be provided, whether in a README file or other annotation file.
1. A separate descriptive document must be provided, even if annotations are included within the document header.
2. For data descriptions provided within a README file:
  1. An external file specification may be referenced, preferably within the README file.
  2. For multiple data types and data specific file encodings, each data type must be described in an annotation file, such as a README file.
Attributable data, such as Personally Identifiable Information or Personal Health Information, MUST be removed from or obfuscated within the encoded file.
1. Common examples are medical images (DICOM, pathology) and DNA/RNA BAM files.
Encoded files should reside within their own directory folder, separate from text files.

Open and Proprietary Encoded Data Requirements

Encoded files should reside within their own directory folder, separate from text files.
Data specific README files are encouraged.
A description of relevant field tags encoded within each data type is very helpful to data consumers.

Data Integrity and Quality

MD5 Checksums

Data integrity is of utmost concern to the NASA Life Sciences Data Archive. The first line of data integrity verification is based on a handshake between the data submitters and the LSDA team. In order to ensure research data are accurately transmitted upon submission to the LSDA, verification of a cryptographic hash will be employed. Specifically, the LSDA will require those submitting data to include an MD5 (128 bit) hash for each file. The specification for the Message-Digest Algorithm 5 (MD5) can be found here: https://tools.ietf.org/html/rfc1321.

An MD5 Checksum is required for each data and metadata file, but may also be provided for every file, including README and annotation files (see MD5 Checksum Section). It is common to include a single md5sum file in the top level directory, but multi-part md5sum files may be provided for each sub-directory. Below is a common example of the contents of an md5sum file.

Example contents of md5sum file:
md5sum file_location
957c168884ccc1dbfb0e1028ffd1e53e /project/assay/raw/missionX_subjectA_assay.tar
af1b39494c2c66a21add7e80a3c5d7d3 /project/assay/raw/missionX_subjectB_assay.tar

md5sum	file_location
957c168884ccc1dbfb0e1028ffd1e53e	/project/assay/raw/missionX_subjectA_assay.tar
af1b39494c2c66a21add7e80a3c5d7d3	/project/assay/raw/missionX_subjectB_assay.tar

Encryption

DO NOT encrypt files directly. Encrypting specific files requires the use of passcodes, encryption keys, or certificates. The LSDA cannot accept and manage passwords and keys for individual files. Encrypted data will be automatically rejected and the submitter will be required to submit data that is not encrypted. Encryption of sensitive data shall be accomplished by transmitting the data using a secure method (i.e. SSL, HTTPS, SFTP, etc.) and by relying upon the encryption capability built into the computer’s operating system or underlying hardware.

Field Label	Permissions
Field Name	perms
Date Type	string
Valid Options	ACL_ALLOW,ACL_WRITE,ACL_MANAGE,ACL_DENY,ACL_DELETE,ACL_INHERIT,ACL_SET_READ,ACL_SET_WRITE,ACL_SET_DELETE,ACL_SET_DENY,ACL_SET_INHERIT,ACL_SET_CLONE
Screen Group	Record Level Basic Access Control

Field Label	Users and Roles
Field Name	ACL_ALLOW
Allow Multiple	true
Date Type	string
Screen Group	Record Level Basic Access Control
Array Field	true

Data Submission Guidelines

Guidelines

File Organization

Metadata and Readme Files

File Formatting

Data Integrity and Quality

File Organization

Organize Files

Primary Level Directory Folder

Secondary/Tertiary Level Folder

Additional Files

File Naming Convention

File Naming Best Practices:

Elements to consider in the naming convention:

File Naming Examples: (Just examples, specific convention up to the investigator)

Metadata Guidelines

Best Practices

Description

Investigation-Study-Assay Metadata

Data File Formatting Guidelines

Columnar/Tabular Data

Columnar/Tabular Data Requirements

Columnar/Tabular Data Recommendations

Encoded Data (Open or Proprietary)

Open and Proprietary Encoded Data Requirements

Data Integrity and Quality

MD5 Checksums

Encryption

Saving ....

Data Submission Guidelines

Guidelines

File Organization

Metadata and Readme Files

File Formatting

Data Integrity and Quality

File Organization

Organize Files

Primary Level Directory Folder

Secondary/Tertiary Level Folder

Additional Files

File Naming Convention

File Naming Best Practices:

Elements to consider in the naming convention:

File Naming Examples: (Just examples, specific convention up to the investigator)

Metadata Guidelines

Best Practices

Description

Investigation-Study-Assay Metadata

Data File Formatting Guidelines

Columnar/Tabular Data

Columnar/Tabular Data Requirements

Columnar/Tabular Data Recommendations

Encoded Data (Open or Proprietary)

Open and Proprietary Encoded Data Requirements

Data Integrity and Quality

MD5 Checksums

Encryption