This readme file was generated on 2025-03-08 by Jiqing Wu GENERAL INFORMATION Title of Dataset: The Tera-MIND dataset derived from three tera-scale mouse brain atlases Author/Principal Investigator Information Name: Viktor H. Koelzer ORCID: 0000-0001-9206-4885 Institution: University Hospital of Basel Address: Petersgraben 4, 4031 Basel, Switzerland Email: Viktor.Koelzer@usb.ch Author/Associate or Co-investigator Information Name: Jiqing Wu ORCID: 0000-0002-6898-8698 Institution: University of Basel Address: Hegenheimermattweg 167b, 4123 Allschwil, Switzerland Email: Jiqing.Wu@unibas.ch Date of data collection: 2024-01-01 Geographic location of data collection: Zurich, Switzerland Information about funding sources that supported the collection of the data: N/A SHARING/ACCESS INFORMATION Licenses/restrictions placed on the data: CC-BY-NC-4.0 Links to publications that cite or use the data: https://doi.org/10.48550/arXiv.2503.01220 Links to other publicly accessible locations of the data: https://doi.org/10.5281/zenodo.14745019 Links/relationships to ancillary data sets: Checkpoints of the Tera-MIND PyTorch models, raw figures of the manuscript plots, whole slide images of the generated and ground-truth mouse brains, etc. Was data derived from another source? Yes If yes, list source(s): https://doi.org/10.35077/g.610 Recommended citation for this dataset: @misc{wu2025teramindterascalemousebrain, title={Tera-MIND: Tera-scale mouse brain simulation via spatial mRNA-guided diffusion}, author={Jiqing Wu and Ingrid Berg and Yawei Li and Ender Konukoglu and Viktor H. Koelzer}, year={2025}, eprint={2503.01220}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.01220}, } DATA & FILE OVERVIEW File List: main/gene_638850.zip, main/img_638850.zip: The collection of patch-wise spatial transcriptomic (ST) image data, and DAPI- and PolyT-stained bioimages, which are derived from one P56 female mouse brain atlas for main analysis. supp_m/gene_609889.zip, supp_m/img_609889.zip: The collection of patch-wise spatial ST image data, and DAPI- and PolyT-stained bioimages, which are derived from one P56 male mouse brain atlas for supporting analysis. supp_f/gene_609882.zip, supp_f/img_609882.zip: The collection of patch-wise spatial ST image data, and DAPI- and PolyT-stained bioimages, which are derived from one P56 female mouse brain atlas for supporting analysis. Relationship between files, if important: The data stored in gene_*.zip and img_*.zip are paired and have the same spatial resolution. Additional related data collected that was not included in the current data package: N/A Are there multiple versions of the dataset? No If yes, name of file(s) that was updated: Why was the file updated? When was the file updated? METHODOLOGICAL INFORMATION Description of methods used for collection/generation of data: QuPath, Fiji, and ABBA software; Python with Zarr and Sparse packages. Methods for processing the data: The three datasets are derived from three mouse brain atlases (MBA) respectively, containing MERFISH imaging and DAPI/PolyT-stained imaging data. After brain image registration and quality control, 50 slides of paired ST and DAPI/PolyT data remain. Then, we sequentially crop 119808 (ST, DAPI/PolyT) image data pairs out of the MBA, at a spatial resolution of 512 x 512. For efficiently feeding the data pairs to the GenAI model, we store the sparse ST data in the format of PyData/sparse compatible *.npz and the bioimage array in the format of zarr compatible *.zip. Instrument- or software-specific information needed to interpret the data: Pytorch==2.5.0, Pytorch-lightning==2.5.0 zarr==2.14.1, scipy, opencv-python, matplotlib, seaborn, pandas, einops, timm, cellpose, sparse, CLIP, pyvips. Standards and calibration information, if appropriate: N/A Environmental/experimental conditions: N/A Describe any quality-assurance procedures performed on the data: N/A People involved with sample collection, processing, analysis and/or submission: Jiqing Wu, Ingrid Berg, Viktor H. Koelzer DATA-SPECIFIC INFORMATION FOR: N/A Number of variables: N/A Number of cases/rows: N/A Variable List: N/A Missing data codes: N/A Specialized formats or other abbreviations used: N/A