WIP: data in the cloud module #44

maxrjones · 2025-11-21T22:36:36Z

Not quite ready for a review yet, will resume work on this ~Dec 1

🔍 Preview: https://geojupyter-workshop-open-source-geospatial--44.org.readthedocs.build/
Note: This Pull Request preview is provided by ReadTheDocs. Our production website, however, is currently deployed with GitHub Pages.

mfisher87 · 2025-11-24T18:23:15Z

modules/04-data-in-the-cloud/pyproject.toml

I like the idea of having dependencies specified for each module, but I think logistically, it makes the most sense to toss these in to one big conda environment users will use for the whole workshop.

I think this makes sense for the workshop. I also would like to provide the pixi files as a separate option, so that people interested in following along with the entire course could use the docker file but anyone who wants to try out just this module can use the pixi environment.

I think that makes sense. Perhaps we should all do this for our modules. Any thoughts on keeping all these environment files up-to-date?

I'll plan to run the tutorial using either option to make sure everything runs correctly. I don't think we should try to get the environments to be identical, since the dependency solving will change between the more limited and larger environment sizes.

mfisher87 · 2025-11-24T19:33:40Z

A couple notes as I'm reading:

Let's explain the %%time magic, maybe using a glossary entry?
Let's define "data-proximate computing" in the glossary. Provide some AKAs like "Edge computing" or "near-data computing"
"Cloud native data are structured for efficient querying across the network" -- better to say "over the Internet"? Network is a very general term and while it's probably more correct (sometimes we're querying cloud native data over a private network), it may be harder to internalize? Not really sure, just a feeling.
"This took a lot of time to open the file." How much time? Maybe an order of magnitude, like "It took over 1 second to open this file" and contrast with the faster approach as "Now we can open the file in milliseconds". I observed about a 4x wall time speedup on a tiny 2GB/0.5CPU node.
Let's define "globbing". In the glossary perhaps?
We could more explicitly state that the behavior of reading the entire file in to memory is triggered by switching from ObstoreReader to ObstoreMemCacheReader. That's shown in the code example, but to know this the reader will have to rotely compare the two. It's not hard, but it's a tiny bit of cognitive load we can save.
How do you feel about removing any assignments that are only used once, e.g. bucket and path? I think this would also be a tiny cognitive load savings.

I did some tweaks that I felt confident were uncontroversial, and I'm going to merge what we have!

The prompt "List the files available following at this prefix on AWS S3 storage" left me expecting the list of files to be output. If we intend to not output it we could instead say "Create a list of the files..."?

mfisher87 · 2025-11-24T19:53:59Z

Do you want this notebook to show up as executed in the course materials website? Or do you want the participants to execute it themselves to see the results?

mfisher87 · 2025-11-24T20:04:28Z

Not quite ready for a review yet, will resume work on this ~Dec 1

I apologize, I didn't read this carefully :D I'll make it a draft PR.

modules/04-data-in-the-cloud/index.ipynb

abarciauskas-bgse · 2025-12-11T19:25:07Z

modules/04-data-in-the-cloud/index.ipynb

+    "There are a couple implications that you should be aware of when working with data on the cloud:\n",
+    "\n",
+    "- Pay-as-you-go - Most cloud providers use pay-as-you-go pricing, where you only pay for the storage and services that you use. This can potentially reduce costs, especially upfront costs (e.g., you never need to buy a hard drive). However, **it can be easy to forget about data in storage and continue to pay for it indefinitely**.\n",
+    "- Time and cost of bringing data to your computer - Hosting the data on the cloud naturally means it's no longer already near your computer's processing resources. Transporting data from the cloud to your computer is expensive, since most cloud providers charge for any data leaving their network, and slow, since the data needs to travel large distances. The primary solution for this is \"data-proximate computing\" which involves running your code on computing resources in the same cloud location as your data. For example, I commonly use NASA data products that are hosted on AWS servers in the 'us-west-2' region, which corresponds to Oregon in the figure above. Following the \"data-proximate computing\" paradigm, I use AWS compute resources that are also in Oregon when working with those data, rather than downloading data to use the computing resources on my laptop in North Carolina. In addition to \"data-proximate computing\", there are many other ways to make working with data on the cloud cheaper and easier. Let's take a look!"


Minor but I might create a new paragraph that starts with:

The primary solution for the second bullet, "time and cost bringing data to your computer", is "data-proximate computing"...

modules/04-data-in-the-cloud/index.ipynb

abarciauskas-bgse · 2025-12-11T19:34:29Z

modules/04-data-in-the-cloud/index.ipynb

+    "\n",
+    "- [Cloud-Optimized Geospatial Formats Guide](https://guide.cloudnativegeo.org/)\n",
+    "- [Xarray Tutorial - Zarr in Cloud Object Storage](https://tutorial.xarray.dev/intermediate/remote_data/cmip6-cloud.html)\n",
+    "- [Xarray Tutorial - Access Patterns to Remote Data with fsspec](https://tutorial.xarray.dev/intermediate/remote_data/cmip6-cloud.html)\n"


No pressure to do this, but you could also link to https://icesat-2-2024.hackweek.io/tutorials/cloud-computing/00-goals-and-outline.html which has very similar content.

abarciauskas-bgse · 2025-12-11T19:34:52Z

@maxrjones This is awesome! I didn't run the code but let me know if you need me to do that as well.

Co-authored-by: Aimee Barciauskas <[email protected]>

maxrjones and others added 4 commits November 21, 2025 17:04

Start to data in the cloud modules

08a9405

Fixup

05218db

Codespell: Ignore base64-looking strings, e.g. Notebook outputs

3355eeb

Adjust AWS link targets to try and fix failed link check

9ecaa59

mfisher87 reviewed Nov 24, 2025

View reviewed changes

Fixup AWS URL

2823797

mfisher87 force-pushed the module-4 branch from ffb1cf1 to 2823797 Compare November 24, 2025 18:30

mfisher87 added 8 commits November 24, 2025 18:41

Move learning objectives into "where we are going" callout

980ad2b

Copy frontmatter from Markdown -> NB

8daced9

Cleanup some blank lines

279b37d

Remove redundant index.md

2d056ba

Consistent capitalization

cce1ff3

Adjust wording for clarity

389fb34

Emphasize statement about risk

f2d03b2

Fixup schedule link

bb65e18

mfisher87 added 2 commits November 24, 2025 19:46

Add obstore, obspec-utils to environment

0a9c63f

Print out the list of files

96d79ba

The prompt "List the files available following at this prefix on AWS S3 storage" left me expecting the list of files to be output. If we intend to not output it we could instead say "Create a list of the files..."?

Move dependencies to Docker image

d430140

mfisher87 marked this pull request as draft November 24, 2025 20:04

maxrjones added 5 commits December 5, 2025 11:53

Address comments

41dc527

Allow building site on macos

d520ffa

Re-add module-level project toml

62d7764

Add uv/pixi to image

119b2bd

Progress

1f7dd53

maxrjones self-assigned this Dec 8, 2025

maxrjones added 2 commits December 9, 2025 17:41

Merge branch 'main' into module-4

9958795

Pin to working versions

0d86092

maxrjones added 4 commits December 11, 2025 16:30

Commit results

545d87d

Add local results

469494a

Merge branch 'main' into module-4

351c8fb

Remove obstore

ba68a77