Jupyter Notebook Binder

Analysis flow#

Here, we’ll track typical data transformations like subsetting that occur during analysis.

If exploring more generally, read this first: Project flow.

Setup#

# a lamindb instance containing Bionty schema
!lamin init --storage ./analysis-usecase --schema bionty
Hide code cell output
πŸ’‘ connected lamindb: testuser1/analysis-usecase
import lamindb as ln
import bionty as bt
from lamin_utils import logger

bt.settings.auto_save_parents = False
πŸ’‘ connected lamindb: testuser1/analysis-usecase

Register an initial dataset#

Here we register an initial artifact with a pipeline script.

# register_example_file.py


def register_example_file():
    # create a pipeline transform to track the registration of the artifact
    transform = ln.Transform(
        name="register example artifact", type="pipeline", version="0.0.1"
    )
    ln.track(transform=transform)

    # an example dataset that has a few cell type, tissue and disease annotations
    adata = ln.core.datasets.anndata_with_obs()

    # validate and register features
    genes = bt.Gene.from_values(
        adata.var_names,
        bt.Gene.ensembl_gene_id,
        organism="human",
        )
    ln.save(genes)
    obs_features = ln.Feature.from_df(adata.obs)
    ln.save(obs_features)

    # validate and register labels
    cell_types = bt.CellType.from_values(adata.obs["cell_type"])
    ln.save(cell_types)
    tissues = bt.Tissue.from_values(adata.obs["tissue"])
    ln.save(tissues)
    diseases = bt.Disease.from_values(adata.obs["disease"])
    ln.save(diseases)

    # register artifact and annotate with features & labels
    artifact = ln.Artifact.from_anndata(
        adata,
        description="anndata with obs"
    )
    artifact.save()
    artifact.features.add_from_anndata(
        var_field=bt.Gene.ensembl_gene_id,
        organism="human",
    )
    features = ln.Feature.lookup()
    artifact.labels.add(cell_types, features.cell_type)
    artifact.labels.add(tissues, features.tissue)
    artifact.labels.add(diseases, features.disease)


register_example_file()
Hide code cell output
πŸ’‘ saved: Transform(uid='Fnmq9cMjeL2niohD', name='register example artifact', version='0.0.1', type='pipeline', updated_at=2024-04-22 10:26:59 UTC, created_by_id=1)
πŸ’‘ saved: Run(uid='U7KFtp87aMxXaVVedqxm', transform_id=1, created_by_id=1)
❗ did not create CellType record for 1 non-validated name: 'my new cell type'

Pull the registered dataset, apply a transformation, and register the result#

Set the current notebook as the new transform:

ln.settings.transform.stem_uid = "eNef4Arw8nNM"
ln.settings.transform.version = "0"
ln.track()
πŸ’‘ notebook imports: bionty==0.42.9 lamin_utils==0.13.2 lamindb==0.70.3
πŸ’‘ saved: Transform(uid='eNef4Arw8nNM6K79', name='Analysis flow', key='analysis-flow', version='0', type='notebook', updated_at=2024-04-22 10:27:05 UTC, created_by_id=1)
πŸ’‘ saved: Run(uid='INVI1mFh2xo2u0qKTqOK', transform_id=2, created_by_id=1)
artifact = ln.Artifact.filter(description="anndata with obs").one()
artifact.describe()
Artifact(uid='EDCL98nOhPBVz4KLp3Ez', suffix='.h5ad', accessor='AnnData', description='anndata with obs', size=46992, hash='IJORtcQUSS11QBqD-nTD0A', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2024-04-22 10:27:04 UTC)

Provenance:
  πŸ“Ž storage: Storage(uid='7YJnC0RP', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local')
  πŸ“Ž transform: Transform(uid='Fnmq9cMjeL2niohD', name='register example artifact', version='0.0.1', type='pipeline')
  πŸ“Ž run: Run(uid='U7KFtp87aMxXaVVedqxm', started_at=2024-04-22 10:26:59 UTC, is_consecutive=True)
  πŸ“Ž created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1')
Features:
  var: FeatureSet(uid='N5DR4DzgeCENHodD4FuG', n=99, type='number', registry='bionty.Gene')
    'CD38', 'ABCB5', 'GCLC', 'CASP10', 'LAP3', 'PDK4', 'CALCR', 'TNMD', 'NFYA', 'CREBBP', 'MPO', 'COPZ2', 'DBNDD1', 'AOC1', 'ST7', 'SEMA3F', 'DVL2', 'PRSS22', 'KDM1A', 'CIAPIN1', ...
  obs: FeatureSet(uid='jmRg0XOboRrnTloIo2jn', n=4, registry='core.Feature')
    πŸ”— cell_type (3, bionty.CellType): 'hepatocyte', 'T cell', 'hematopoietic stem cell'
    cell_type_id (category)
    πŸ”— tissue (4, bionty.Tissue): 'liver', 'kidney', 'heart', 'brain'
    πŸ”— disease (4, bionty.Disease): 'cardiac ventricle disorder', 'chronic kidney disease', 'Alzheimer disease', 'liver lymphoma'
Labels:
  πŸ“Ž tissues (4, bionty.Tissue): 'liver', 'kidney', 'heart', 'brain'
  πŸ“Ž cell_types (3, bionty.CellType): 'hepatocyte', 'T cell', 'hematopoietic stem cell'
  πŸ“Ž diseases (4, bionty.Disease): 'cardiac ventricle disorder', 'chronic kidney disease', 'Alzheimer disease', 'liver lymphoma'

Get a backed AnnData object#

adata = artifact.backed()
adata
AnnDataAccessor object with n_obs Γ— n_vars = 40 Γ— 100
  constructed for the AnnData object EDCL98nOhPBVz4KLp3Ez.h5ad
    obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
    var: ['_index']

Subset dataset to specific cell types and diseases#

cell_types = artifact.cell_types.all().lookup(return_field="name")
diseases = artifact.diseases.all().lookup(return_field="name")

Create the subset:

subset_obs = adata.obs.cell_type.isin(
    [cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))
adata_subset = adata[subset_obs]
adata_subset
AnnDataAccessorSubset object with n_obs Γ— n_vars = 20 Γ— 100
  obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
  var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
cell_type                disease               
T cell                   chronic kidney disease    10
hematopoietic stem cell  liver lymphoma            10
dtype: int64

Register the subsetted AnnData:

file_subset = ln.Artifact.from_anndata(
    adata_subset.to_memory(),
    description="anndata with obs subset"
)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/anndata/_core/anndata.py:1820: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
file_subset.save()
file_subset.features.add_from_anndata(
    var_field=bt.Gene.ensembl_gene_id,
    organism="human",  # optionally, globally set organism via bt.settings.organism = "human"
    )
features = ln.Feature.lookup()

file_subset.labels.add(adata_subset.obs.cell_type, features.cell_type)
file_subset.labels.add(adata_subset.obs.disease, features.disease)
file_subset.labels.add(adata_subset.obs.tissue, features.tissue)

Examine data flow#

Query a subsetted .h5ad artifact containing β€œhematopoietic stem cell” and β€œT cell”:

cell_types = bt.CellType.lookup()
my_subset = ln.Artifact.filter(
    suffix=".h5ad",
    description__endswith="subset",
    cell_types__in=[
        cell_types.hematopoietic_stem_cell,
        cell_types.t_cell,
    ],
).first()
my_subset
Artifact(uid='d4OHaEmhMhrx1nKYWqwe', suffix='.h5ad', accessor='AnnData', description='anndata with obs subset', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2024-04-22 10:27:06 UTC, storage_id=1, transform_id=2, run_id=2, created_by_id=1)

Common questions that might arise are:

  • What is the history of this artifact?

  • Which features and labels are associated with it?

  • Which notebook analyzed and registered this artifact?

  • By whom?

  • And which artifact is its parent?

Let’s answer this using LaminDB:

print("--> What is the history of this artifact?\n")
file_subset.view_lineage()

print("\n\n--> Which features and labels are associated with it?\n")
logger.print(file_subset.features)
logger.print(file_subset.labels)

print("\n\n--> Which notebook analyzed and registered this artifact\n")
logger.print(file_subset.transform)

print("\n\n--> By whom\n")
logger.print(file_subset.created_by)

print("\n\n--> And which artifact is its parent\n")
display(file_subset.run.input_artifacts.df())
--> What is the history of this artifact?
_images/662a12bc8fe2fb05984875945cb7fbe186324d5145e04102d09d7c79502aea61.svg
--> Which features and labels are associated with it?

Features:
  var: FeatureSet(uid='N5DR4DzgeCENHodD4FuG', n=99, type='number', registry='bionty.Gene')
    'CD38', 'ABCB5', 'GCLC', 'CASP10', 'LAP3', 'PDK4', 'CALCR', 'TNMD', 'NFYA', 'CREBBP', 'MPO', 'COPZ2', 'DBNDD1', 'AOC1', 'ST7', 'SEMA3F', 'DVL2', 'PRSS22', 'KDM1A', 'CIAPIN1', ...
  obs: FeatureSet(uid='jmRg0XOboRrnTloIo2jn', n=4, registry='core.Feature')
    πŸ”— cell_type (2, bionty.CellType): 'T cell', 'hematopoietic stem cell'
    cell_type_id (category)
    πŸ”— tissue (2, bionty.Tissue): 'liver', 'kidney'
    πŸ”— disease (2, bionty.Disease): 'chronic kidney disease', 'liver lymphoma'
Labels:
  πŸ“Ž tissues (2, bionty.Tissue): 'liver', 'kidney'
  πŸ“Ž cell_types (2, bionty.CellType): 'T cell', 'hematopoietic stem cell'
  πŸ“Ž diseases (2, bionty.Disease): 'chronic kidney disease', 'liver lymphoma'
--> Which notebook analyzed and registered this artifact

Transform(uid='eNef4Arw8nNM6K79', name='Analysis flow', key='analysis-flow', version='0', type='notebook', updated_at=2024-04-22 10:27:05 UTC, created_by_id=1)
--> By whom

User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2024-04-22 10:26:57 UTC)
--> And which artifact is its parent
uid storage_id key suffix accessor description version size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual created_at updated_at created_by_id
id
1 EDCL98nOhPBVz4KLp3Ez 1 None .h5ad AnnData anndata with obs None 46992 IJORtcQUSS11QBqD-nTD0A md5 None None 1 1 1 True 2024-04-22 10:27:04.805626+00:00 2024-04-22 10:27:04.900459+00:00 1
Hide code cell content
!lamin delete --force analysis-usecase
!rm -r ./analysis-usecase
πŸ’‘ deleting instance testuser1/analysis-usecase
❗ manually delete your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase