Phenopacket First Dataset Loader

A definition of a dataset directory structure that uses Phenopacket files as the primary index.

File/object structure

File/objects are expected to be submitted into the dataset in batches and stored by batch name in a folder. Batches should sort alphabetically by time of submission (so probably an ISO datetime - but could be incrementing number for all we care).

e.g.

my_dataset
  2020-01-03
    file1.bam
    file1.bam.bai
  2020-01-16

  2021-05-03

All batches together make up an overall dataset - that is, batches do not need to be internally self-contained. A fastq pair can span two batches for instance.

Once submitted a batch is considered immutable (corrections can be made via mechanisms described below). The data model is accretive.

Content of a batch can currently only be a single level (no nested directories within a batch - though this restriction may be lifted in the future).

Checksums

Every submission batch folder MUST have an md5sums.txt consisting of the MD5 checksums of all files in the folder.

md5sum * > md5sums.txt
-or-
md5sum-lite * > md5sums.txt

It is safe to have md5sum checksum its own sums file - as this entry will safely be ignored by the algorithm.

All files in the submission folder MUST be included in the mds5sums.txt and all files listed in the md5sums.txt MUST be present in the folder.

Correction of files

Files can be replaced in a later submission by a new file of the same name.

A file can be deleted by creating a later submission with an empty file of the same name.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
example-npm-lib-use		example-npm-lib-use
examples-bad		examples-bad
examples/test1		examples/test1
pfdl		pfdl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
REQUIREMENTS.md		REQUIREMENTS.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phenopacket First Dataset Loader

File/object structure

Checksums

Correction of files

About

Releases 1

Packages

Languages

License

elsa-data/phenopacket-first-dataset-loader

Folders and files

Latest commit

History

Repository files navigation

Phenopacket First Dataset Loader

File/object structure

Checksums

Correction of files

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages