-
Notifications
You must be signed in to change notification settings - Fork 4
WIP: data in the cloud module #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of having dependencies specified for each module, but I think logistically, it makes the most sense to toss these in to one big conda environment users will use for the whole workshop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense for the workshop. I also would like to provide the pixi files as a separate option, so that people interested in following along with the entire course could use the docker file but anyone who wants to try out just this module can use the pixi environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense. Perhaps we should all do this for our modules. Any thoughts on keeping all these environment files up-to-date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll plan to run the tutorial using either option to make sure everything runs correctly. I don't think we should try to get the environments to be identical, since the dependency solving will change between the more limited and larger environment sizes.
|
A couple notes as I'm reading:
I did some tweaks that I felt confident were uncontroversial, and I'm going to merge what we have! |
The prompt "List the files available following at this prefix on AWS S3 storage" left me expecting the list of files to be output. If we intend to not output it we could instead say "Create a list of the files..."?
|
Do you want this notebook to show up as executed in the course materials website? Or do you want the participants to execute it themselves to see the results? |
I apologize, I didn't read this carefully :D I'll make it a draft PR. |
| "There are a couple implications that you should be aware of when working with data on the cloud:\n", | ||
| "\n", | ||
| "- Pay-as-you-go - Most cloud providers use pay-as-you-go pricing, where you only pay for the storage and services that you use. This can potentially reduce costs, especially upfront costs (e.g., you never need to buy a hard drive). However, **it can be easy to forget about data in storage and continue to pay for it indefinitely**.\n", | ||
| "- Time and cost of bringing data to your computer - Hosting the data on the cloud naturally means it's no longer already near your computer's processing resources. Transporting data from the cloud to your computer is expensive, since most cloud providers charge for any data leaving their network, and slow, since the data needs to travel large distances. The primary solution for this is \"data-proximate computing\" which involves running your code on computing resources in the same cloud location as your data. For example, I commonly use NASA data products that are hosted on AWS servers in the 'us-west-2' region, which corresponds to Oregon in the figure above. Following the \"data-proximate computing\" paradigm, I use AWS compute resources that are also in Oregon when working with those data, rather than downloading data to use the computing resources on my laptop in North Carolina. In addition to \"data-proximate computing\", there are many other ways to make working with data on the cloud cheaper and easier. Let's take a look!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor but I might create a new paragraph that starts with:
The primary solution for the second bullet, "time and cost bringing data to your computer", is "data-proximate computing"...
| "\n", | ||
| "- [Cloud-Optimized Geospatial Formats Guide](https://guide.cloudnativegeo.org/)\n", | ||
| "- [Xarray Tutorial - Zarr in Cloud Object Storage](https://tutorial.xarray.dev/intermediate/remote_data/cmip6-cloud.html)\n", | ||
| "- [Xarray Tutorial - Access Patterns to Remote Data with fsspec](https://tutorial.xarray.dev/intermediate/remote_data/cmip6-cloud.html)\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No pressure to do this, but you could also link to https://icesat-2-2024.hackweek.io/tutorials/cloud-computing/00-goals-and-outline.html which has very similar content.
|
@maxrjones This is awesome! I didn't run the code but let me know if you need me to do that as well. |
Co-authored-by: Aimee Barciauskas <[email protected]>
Co-authored-by: Aimee Barciauskas <[email protected]>
Co-authored-by: Aimee Barciauskas <[email protected]>
Co-authored-by: Aimee Barciauskas <[email protected]>
Co-authored-by: Aimee Barciauskas <[email protected]>
Co-authored-by: Aimee Barciauskas <[email protected]>
Co-authored-by: Aimee Barciauskas <[email protected]>
Not quite ready for a review yet, will resume work on this ~Dec 1
🔍 Preview: https://geojupyter-workshop-open-source-geospatial--44.org.readthedocs.build/
Note: This Pull Request preview is provided by ReadTheDocs. Our production website, however, is currently deployed with GitHub Pages.