Next Gen LSDA
The directory structure of your submittal can significantly contribute to understanding the data package and efficient consumption of the data. File organization should move from general information in the top directories to increasingly specific information in the sub-directories. This section outlines a number of recommendations to support the development of a well-formed data package.
Organize files into a logical directory structure, and not as a loose set of files within a single folder.
The primary level directory folder should contain:
file_location | size_bytes |
---|---|
/project/assay/raw/missionX_subjectA_assay.tar | 3139235840 |
/project/assay/raw/missionX_subjectB_assay.tar | 2713293944 |
md5sum | file_location |
---|---|
957c168884ccc1dbfb0e1028ffd1e53e | /project/assay/raw/missionX_subjectA_assay.tar |
af1b39494c2c66a21add7e80a3c5d7d3 | /project/assay/raw/missionX_subjectB_assay.tar |
Secondary/tertiary level directory folders should contain:
Additional files – such as summary findings, reports and statistics – may be included within the primary or secondary level directory folders, but not intermixed with data files. A sample directory structure is provided in Figure 1:
Descriptive file names are an important aspect of organizing, sharing, and managing data files. Develop a naming convention based on elements important to the project. Additionally, do not reuse simple file names multiple times in different directories. Ideally, use file names that are unique to the data set. This section builds on the guidelines of the Princeton University Library best practices for file naming. The most important aspects to remember about file naming are to be consistent and descriptive in naming and organizing your files so that it’s obvious where to find a file and what it contains
README File Contents — Description (Minimal recommendations in  bold)
File Contents — Data and File Overview (Minimal recommendations in l bold)
README File Contents — Sharing and Accessing Information (Minimal recommendations in l bold)
README File Contents — Methodology (Minimal recommendations in l bold)
README File Contents — Data Specific Information (Minimal recommendations in l bold;)
README File Contents — Examples of Metadata Standards: Table 1
Source | Content |
---|---|
BioPortal | Biomedical ontologies – comprehensive resource for molecular, biological, and medical ontologies |
Integrated Taxonomic Information System | taxonomic information on plants, animals, fungi, microbes |
NASA Thesaurus | engineering, physics, astronomy, astrophysics, planetary science, Earth sciences, biological sciences |
GCMD Keywords | Earth & climate sciences, instruments, sensors, services, data centers, etc. |
USGS Thesaurus | agriculture, forest, fisheries, Earth sciences, life sciences, engineering, planetary sciences, social sciences etc. |
Getty Research Institute Vocabularies | geographic names, art & architecture, cultural objects, artist names |
LSDA is transitioning to an Investigation-Study-Assay (ISA) framework for metadata which will provide detailed context and descriptions of experiments within the archive. These metadata enhance data discoverability within the archive database, and ensure the data can be efficiently exchanged and integrated for use by future research.2 Specifications for submitting data in compliance with ISA documentation will be forthcoming
For now, please see the ISA documentation for an introduction to the topic: https://isa-specs.readthedocs.io/en/latest/isamodel.html
This section will address the requirements and recommendations for the structure and description of the two most common forms of data:
The most common form of a data file is a tabular data frame, like a spreadsheet.
The table form consists of rows and columns of information.
Each column contains information regarding a specific attribute, and each row consists
of various data related to an entity or observation.
The columns have headers that describe the data in each column should represent a specific variable
with a well described information type and structure.
At a minimum, the rows should have a unique index, made up of one or more columns.
In general, data files should be machine-readable first and human-readable second.
The data should be developed for pulling the information into analytical systems
(such as Jupyter Notebooks, Matlab, etc.) and not for presentation purposes.
Examples: tab or comma delimited text files, Microsoft Office Excel spreadsheets, SAS/STAT data, etc.
Encoded files consist of data that adhere to a specific encoding and format that must be read
using a specific software or must be decoded using a file format specification.
Many reasons exist to encode a file. Most software vendors employ unique file encodings
to ensure compatibility with and use of their software.
Additionally, data may be encoded for the purpose of addressing specific data needs,
obscuring information, and reducing file sizes, to name a few.
Image files represent a common type of file based on a file encoding.
In order to use encoded data, a data specification must be published so that
consumers of the data will know how to decode the data. In the case of common
files such as images, file reader tools should be available so that data users
are not required to develop their own software. In the case of text encoded data,
these data should adhere to common standards, such as base64 encoding, which can
easily be handled within most programming languages.
Examples of commonly used file specifications:
Open and Proprietary Encoded Data Requirements
Data integrity is of utmost concern to the NASA Life Sciences Data Archive.
The first line of data integrity verification is based on a handshake between the data submitters and the LSDA team.
In order to ensure research data are accurately transmitted upon submission to the LSDA, verification of a cryptographic hash will be employed.
Specifically, the LSDA will require those submitting data to include an MD5 (128 bit) hash for each file.
The specification for the Message-Digest Algorithm 5 (MD5) can be found here: https://tools.ietf.org/html/rfc1321.
An MD5 Checksum is required for each data and metadata file, but may also be provided for every file,
including README and annotation files (see MD5 Checksum Section).
It is common to include a single md5sum file in the top level directory, but multi-part md5sum files may be provided for each sub-directory.
Below is a common example of the contents of an md5sum file.
Example contents of md5sum file:
md5sum | file_location |
---|---|
957c168884ccc1dbfb0e1028ffd1e53e | /project/assay/raw/missionX_subjectA_assay.tar |
af1b39494c2c66a21add7e80a3c5d7d3 | /project/assay/raw/missionX_subjectB_assay.tar |
DO NOT encrypt files directly. Encrypting specific files requires the use of passcodes, encryption keys, or certificates. The LSDA cannot accept and manage passwords and keys for individual files. Encrypted data will be automatically rejected and the submitter will be required to submit data that is not encrypted. Encryption of sensitive data shall be accomplished by transmitting the data using a secure method (i.e. SSL, HTTPS, SFTP, etc.) and by relying upon the encryption capability built into the computer’s operating system or underlying hardware.