IMPORTANT: Life Sciences Data Archive (LSDA) is transitioning to a new home. During this transition period, beginning July 18, 2022,
LSDA users will be able to access existing information but will not be able to enter data requests.
After the transition, users will be automatically redirected to the new system, the NASA Life Science Portal (NLSP). Users will need to update any bookmarks they have created.
Life Science Data Archive Data Submission Guidelines
The intent of this document is to provide data submission requirements and recommendations regarding overall information organization,
file naming and file structures, for use by investigators archiving data with NASA’s Life Sciences Data Archive (LSDA).
The guidelines have been developed to enable consistent, high quality, computable data.
While some guidelines in this document are not required, they are highly recommended to ensure data from NASA-funded studies are
findable, accessible, interoperable, and reusable.If you have questions or require assistance complying with the guidelines below,
please contact an LSDA archivist by filling out the form here.
The directory structure of your submittal can significantly contribute to understanding the data package
and efficient consumption of the data. File organization should move from general information in the top
directories to increasingly specific information in the sub-directories. This section outlines a number
of recommendations to support the development of a well-formed data package.
A data manifest of all the files in the complete data package, as a reference to ensure successful data submission.
The data submission should contain a single manifest of all files in the submission, to include the relative
file path name starting from the main data directory (Do not include the full local file path.)
The data manifest should contain two columns separated by a tab:
Relative file path and name, relative to the main folder
File size in bytes
The manifest may be generated using simple scripts, once the file organization is complete
and include the relative file path of the data directory. Do not include names of local directories
that will not be part of the submission package. Below are script examples of how to generate the
manifest, based on operating system, to help the data submitter get started.
An MD5 Checksum is required for each data and metadata file, but may also be provided for every file,
including README and annotation files (see MD5 Checksum Section). It is common to include a single md5sum
file in the top level directory, but multi-part md5sum files may be provided for each sub-directory.
Below is a common example of the contents of an md5sum file.
Secondary/tertiary level directory folders should contain:
Files representing different levels of abstraction grouped within their own folder
Folders with data should only contain files of the same type. Do not mix different file types within the same folder
(e.g. all files related to an RNASeq assay should be contained within a single directory structure).
Nested folders of related data should contain README files to more completely describe the structure of each file type.
Data integrity checks (MD5 Checksum) files may be included at each directory level
Additional files – such as summary findings, reports and statistics – may be included within the primary
or secondary level directory folders, but not mixed with data files.
A sample directory structure is provided in Figure 1:
Descriptive file names are an important aspect of organizing, sharing, and managing data files.
Develop a naming convention based on elements important to the project.
Additionally, do not reuse simple file names multiple times in different directories.
Ideally, use file names that are unique to the data set.
This section builds on the guidelines of the Princeton University Library best practices for file naming.
The most important aspects to remember about file naming are to be consistent and descriptive in naming
and organizing your files so that it’s obvious where to find a file and what it contains.
File Naming Best Practices:
Files should be named consistently.
File names should be descriptive and not too long (<64 characters)
Do not use special characters or spaces in file names
Use Capitals and underscores (‘_’) instead
Use ISO 8601 date format: YYYYMMDD.
Include a version number, where appropriate.
Write down naming convention in the README file
Elements to consider in the naming convention:
Date of creation
Location or institution
Project or mission name
Sample name or type
File Naming Examples: (Just examples, specific convention up to the investigator)
Create one README file for each data type whenever possible. It is also appropriate to describe a "dataset" that has multiple, related, identically formatted files, or files that are logically grouped together for use (e.g. a collection of Matlab scripts). When appropriate, also describe the file structure that holds the related data files.
Name the README file so that it is easily associated with the data file(s) it describes.
Write your README document as a plain text file, avoiding proprietary formats (such as MS Word) whenever possible. Format the README document so it is easy to understand (e.g. separate important pieces of information with blank lines, rather than having all the information in one long paragraph).
Format multiple README files identically. Present the information in the same order, using the same terminology.
Use standardized date formats. Recommended format: W3C/ISO 8601 date standard, which specifies the international standard notation of YYYYMMDD or YYYYMMDDThhmmss.
Follow the scientific conventions for your discipline for taxonomic, geospatial and geologic names, and keywords. Whenever possible, use terms from standardized taxonomies and vocabularies, a few of which are listed in Table 1 below.
README File Contents — Description (Minimal recommendations in bold)
Provide a title for the dataset
Name/institution/address/email information for
Principal investigator (or person responsible for collecting the data)
Associate or co-investigators
Contact person for questions
Date of data collection (can be a single date, or a range)
GMT Standard (not local time zone)
Information about geographic location of data collection
Keywords used to describe the data topic
Information about funding sources that supported the collection of the data
README File Contents — Data and File Overview (Minimal recommendations in bold)
For each filename, a short description of what data it contains
Format of the file if not obvious from the file name
If the data set includes multiple files that relate to one another, the relationship between the files or a description of the file structure that holds them (possible terminology might include "dataset" or "study" or "data package")
Date that the file was created
Date(s) that the file(s) was updated (versioned) and the nature of the update(s), if applicable
Information about related data collected but that is not in the described dataset
README File Contents — Sharing and Accessing Information (Minimal recommendations in bold)
Restrictions placed on the data
Links to publications that cite or use the data
Links to other publicly accessible locations of the data
Recommended citation for the data, especially if some parts of the data persist in a data service other than the LSDA
README File Contents — Methodology (Minimal recommendations in bold)
Description of methods for data collection or generation (include links or references to publications or other documentation containing experimental design or protocols used)
Description of methods used for data processing (describe how the data were generated from the raw or collected data)
Any instrument-specific information needed to understand or interpret the data
Standards and calibration information, if appropriate
Describe any quality-assurance procedures performed on the data
Definitions of codes or symbols used to note or characterize low quality, questionable info or outliers people should be aware of
Point(s) of contact for sample collection, processing, analysis and/or submission
README File Contents — Data Specific Information (Minimal recommendations in bold)
Count of number of variables, and number of cases or rows
Variable list, including full names and definitions of column headings for tabular data (and spell out abbreviated words)
Units of measurement
Definitions for codes or symbols used to record missing data
Specialized formats or other abbreviations used
README File Contents — Examples of Metadata Standards: Table 1
LSDA is transitioning to an Investigation-Study-Assay (ISA) framework for metadata which will provide detailed
context and descriptions of experiments within the archive.
These metadata enhance data discoverability
within the archive database, and ensure the data can be efficiently exchanged and integrated for use by future research. 2
Specifications for submitting data in compliance with ISA documentation will be forthcoming.
This section will address the requirements and recommendations for the structure and description of the two most common forms of data:
User created columnar / tabular data
Spreadsheets and text data files
Unique file format based on a specification
Commonly found in images and vendor specific file formats
For each form of data, there will be a limited set of Requirements, meaning data that does not adhere to these characteristics
will not be acceptable to the LSDA. There will also be a set of Recommendations, which are highly desired characteristics, but not required.
The most common form of a data file is a tabular data frame, like a spreadsheet.
The table form consists of rows and columns of information.
Each column contains information regarding a specific attribute, and each row consists
of various data related to an entity or observation.
The columns have headers that describe the data in each column, which should represent a specific variable
with a well described information type and structure.
At a minimum, the rows should have a unique index, made up of one or more columns.
In general, data files should be machine-readable first and human-readable second.
The data should be developed for pulling the information into analytical systems
(such as Jupyter Notebooks, Matlab, etc.) and not for presentation purposes.
Examples: tab or comma delimited text files, Microsoft Office Excel spreadsheets, SAS/STAT data, etc.
Encoded files consist of data that adhere to a specific encoding and format that must be read
using a specific software or must be decoded using a file format specification.
Many reasons exist to encode a file. Most software vendors employ unique file encodings
to ensure compatibility with and use of their software.
Additionally, data may be encoded for the purpose of addressing specific data needs,
obscuring information, and reducing file sizes, to name a few.
Image files represent a common type of file based on a file encoding.
In order to use encoded data, a data specification must be published so that
consumers of the data will know how to decode the data. In the case of common
files such as images, file reader tools should be available so that data users
are not required to develop their own software. In the case of text encoded data,
these data should adhere to common standards, such as base64 encoding, which can
easily be handled within most programming languages.
Data integrity is of utmost concern to the NASA Life Sciences Data Archive.
The first line of data integrity verification is based on a handshake between the data submitters and the LSDA team.
In order to ensure research data are accurately transmitted upon submission to the LSDA, verification of a cryptographic hash will be employed.
Specifically, the LSDA will require those submitting data to include an MD5 (128 bit) hash for each file.
The specification for the Message-Digest Algorithm 5 (MD5) can be found here: https://tools.ietf.org/html/rfc1321.
An MD5 Checksum is required for each data and metadata file, but also may be provided for every file,
including README and annotation files (see MD5 Checksum Section).
It is common to include a single md5sum file in the top level directory, but multi-part md5sum files may be provided for each sub-directory.
Below is a common example of the contents of an md5sum file.
DO NOT encrypt files directly. Encrypting specific files requires the use of passcodes, encryption keys, or certificates.
The LSDA cannot accept and manage passwords and keys for individual files.
Encrypted data will be automatically rejected, and the submitter will be required to submit data that is not encrypted.
Encryption of sensitive data shall be accomplished by transmitting the data using a secure method (i.e. SSL, HTTPS, SFTP, etc.)
and by relying upon the encryption capability built into the computer’s operating system or underlying hardware.