|
1 | | -# Modelhub |
| 1 | +# Modelforge |
2 | 2 |
|
3 | | -- [Modelhub](#modelhub) |
4 | | - - [Background](#background) |
5 | | - - [Division of code between Modelhub, Datahub, and Cifutils](#division-of-code-between-modelhub-datahub-and-cifutils) |
6 | | - - [Cifutils](#cifutils) |
7 | | - - [Datahub](#datahub) |
8 | | - - [Training, Validation, and Inference](#training-validation-and-inference) |
9 | | - - [Training and Validation](#training-and-validation) |
10 | | - - [Inference](#inference) |
11 | | - - [Setup](#setup) |
12 | | - - [Apptainers](#apptainers) |
13 | | - - [Base Apptainer](#base-apptainer) |
14 | | - - [Inference Apptainer](#inference-apptainer) |
15 | | - - [Shebang](#shebang) |
16 | | - - [General Use](#general-use) |
17 | | - - [Debugging](#debugging) |
| 3 | +## Installation & Usage |
18 | 4 |
|
19 | | -## Background |
| 5 | +Follow these steps to set up **ModelForge** and run a test prediction. |
20 | 6 |
|
21 | | -This repository constitutes the base for deep-learning method development at the Institute for Protein Design. |
| 7 | +--- |
22 | 8 |
|
23 | | -It is symbiotic with two other Institute for Protein Design repositories: |
24 | | -- [cifutils](https://github.com/baker-laboratory/cifutils), which manages input parsing and data cleaning |
25 | | -- [datahub](https://github.com/baker-laboratory/datahub), which manages input featurization and holds our composable `Transform` components |
| 9 | +### 1. Install the repository using `uv` |
26 | 10 |
|
27 | | -Within this ontology, `modelhub` contains the *architectures*, *training* code, and *inference* endpoints. |
28 | | - |
29 | | -## Division of code between Modelhub, Datahub, and Cifutils |
30 | | - |
31 | | -Across our codebases, we balance the need to develop quickly with the need to write code that we can continue to maintain and that is easy to understand. We below lay out some thoughts on what code should live where. |
32 | | - |
33 | | -We enforce a strict dependency flow of `modelhub` -> (depends on) `datahub` -> (depends on) `cifutils`; it would be a circular anti-pattern to thus import any `datahub` or `modelhub` functions from within `cifutils`. |
34 | | - |
35 | | -### Cifutils |
36 | | - |
37 | | -[cifutils](https://github.com/baker-laboratory/cifutils) is the most static of our three codebases. Basic parsing functionality, RDKit and other molecular toolkit utilities, and `AtomArray` quality-of-life tools live in this repository. |
38 | | - |
39 | | -Examples of `cifutils` functions are: |
40 | | -- All functions related to **parsing structural files from source**; e.g., keeping/removing hydrogens, resolving occupancy, etc. |
41 | | -- Utility functions to manipulate `AtomArrays`, the core API of the `biotite` library, upon which we heavily rely |
42 | | -- Utility functions for common bioinformatics software, such as `RDKit`, that interface with `AtomArrays` |
43 | | - |
44 | | -As a foundational library for the Institute for Protein Design, `cifutils` functions most like an open-source codebase. We must keep the code easy-to-understand and easy-to-maintain, both now and into the future. As such, `cifutils`: |
45 | | -- Maintains the **highest code quality standard**, requiring well-documented, easy-to-maintain code with adequate test coverage (we aim for **>85%** coverage) |
46 | | -- **Strictly versions** to minimize breaking changes with downstream repositories |
47 | | - |
48 | | -You should write code in `cifutils` if: |
49 | | -- You are are writing **core** `AtomArray`-level level functionality that will be broadly useful, not only to those at the Institute for Protein Design but possibly the wider bioinformatics community (i.e., without dependencies, or even knowledge of, `datahub` or `modelhub`) |
50 | | -- You are willing to spend some additional time to ensure the code is **scalable, well-tested, and maintainable** |
51 | | - |
52 | | -Quick-and-dirty experiments that require modifying `cifutils` can be performed by submoduling or cloning the repository and exporting a local path. |
53 | | - |
54 | | -### Datahub |
55 | | - |
56 | | -[datahub](https://github.com/baker-laboratory/datahub) manages data loading, preprocessing, and featurization pipelines for structure-dependent deep-learning models. We offer three core components: a `Transforms` library, a set of `Preprocessing` scripts, and `Datasets`. |
57 | | -- **Transforms**: A series of composable classes that take as input a dictionary containing sequence- and structure-based data (in the form of an `AtomArray`) and perform arbitrary operations, analogous to TorchVision's [approach](https://pytorch.org/vision/main/transforms.html) for computer vision |
58 | | -- **Preprocessing**: Scripts and functions for common data cleaning and preparation tasks, including specialized pipelines for frequent use cases (e.g., antibodies, clash detection, cleaning PDB data, etc.). Many of these *scripts* output `parquet` files stored to disk that are sampled from at train-time, while the *functions* are called by the scripts to clean, label, or filter the data (e.g., `has_clash()`, etc.) |
59 | | -- **Datasets**: The base `Datasets` and `Sampler` classes used for training, imported by `modelhub` |
60 | | - |
61 | | -`datahub` is less static than `cifutils`; however, it still must operate as a stand-alone library that others can continue to build around and upon, even without `modelhub`. We strive to maintain `datahub` like an open-source software project such that others in the lab can easily understand, and build upon, our base components. We focus on **maintainable** and **flexible** code - if a particular `Transform` is bespoke or non-generalizable (at least initially), then the `/projects` folder within `Modelhub` may be a more appropriate place for initial development. |
62 | | - |
63 | | -You should write code in `datahub` if: |
64 | | -- You are writing flexible, generic *pre-processing scripts* or *functions* that others in the lab have expressed interest in using (vs. a single-purpose pipeline or feature to test a hypothesis) |
65 | | - - **Example that should live in `datahub`**: You are writing a pre-processing pipeline to label all beta barrels in the PDB. Your scripts, written in a functional manner, may be a good candidate for `datahub/scripts/preprocessing`, so long as you are willing to write them generally and include tests. Similarly, if a single function may be generalizable but the pipeline is bespoke, that single function (with a test) could still be included as a stand-alone element in `datahub`, e.g., |
66 | | - ```python |
67 | | - atom_array_has_beta_barrel(atom_array: AtomArray) -> bool |
68 | | - ``` |
69 | | - - **Example that should live in `modelhub/projects`**: You have pulled together a script that loads PDB files, includes manual annotations, and saves out to CIF. Such a script may be appropriate for the specific use case but is unlikely to generalize across other use cases. |
70 | | -- You are writing `Transforms` that generalize to additional use cases beyond the current project |
71 | | - - **Example that should live in `datahub`**: Any `Transform` that adds a useful annotation to an `AtomArray` (e.g., annotationg pocket residues, hydrogen bonds, SASA, etc.) |
72 | | - - **Example that should live in `datahub`**: A `Transform` that pads DNA with generated B-form structure, as is done in AF-3; such a `Transform` may be applicable to both structure prediction and design, when proven effective |
73 | | - - **Example that should live in `modelhub/projects`**: A `Transform` that aggregates and/or concatenates features for a bespoke model pipeline |
74 | | -- You are willing to spend some additional time to ensure the code is scalable, well-tested, and maintainable. Otherwise the `projects` folder of `modelhub` may be a more appropriate place in the interim |
75 | | - |
76 | | -## Training, Validation, and Inference |
77 | | - |
78 | | -> If you are developing at the IPD, our `shebang` executables will take care of identifying and executing with the most up-do-date apptainer. If you are not at the IPD, you will need to ensure you have the appropriate apptainer. See below for details. |
79 | | - |
80 | | -NOTE: For Training, Validation, and Inference, we make heavy use of [Hydra](https://hydra.cc/) for configuration management. |
81 | | - |
82 | | -Before running any of the below commands, you will need to ensure `datahub` and `cifutils` are in your `PYTHONPATH`. E.g., |
83 | | -``` |
84 | | -export PYTHONPATH="/home/<USER>/projects/datahub/src:/home/<USER>/projects/cifutils/src" |
85 | | -``` |
86 | | - |
87 | | -### Training and Validation |
88 | | - |
89 | | -For Training and Validation, when you execute `train.py` or `validate.py`, you will need to provide an *experiment* Hydra config. Experiments are a Hydra best-practice pattern to enable us to maintain multiple configurations; see more in the [Hydra documentaion](https://hydra.cc/docs/patterns/configuring_experiments/) |
90 | | -and in the `configs/experiment` sub-directory. |
91 | | - |
92 | | -For example, to test AF-3 training without confidence, run: |
| 11 | +```bash |
| 12 | +git clone https://github.com/RosettaCommons/modelforge.git \ |
| 13 | + && cd modelforge \ |
| 14 | + && uv python install 3.12 \ |
| 15 | + && uv venv --python 3.12 \ |
| 16 | + && source .venv/bin/activate \ |
| 17 | + && uv pip install -e . |
93 | 18 | ``` |
94 | | -./src/modelhub/train.py experiment=quick-af3 debug=default |
95 | | -``` |
96 | | -
|
97 | | -**Explanation:** |
98 | | -- `./src/modelhub/train.py` — we execute our `train.py` like a bash executable, which triggers the `shebang` code to find the correct apptainer. It's equivalent to `apptainer exec --nv /path/to/apptainer python ./src/modelhub/train.py` |
99 | | -- `experiment=quick-af3` — we identify the experiment we want to use for training; in this case, `quick-af3`, which can be viewed at `configs/experiment/quick-af3.yaml`. This experiment is a simple test config for AF-3 that loads and runs more rapidly that the full training config |
100 | | -- `debug=default` - a setting letter Hydra know we are debugging; when we debug, we perform some automatic time-savings like setting a small diffusion batch size and crop size. You could remove this line if you don't want those options. You can explore more about various `debug` options in `config/debug` |
101 | 19 |
|
102 | | -For validation only, run the following: |
103 | | -``` |
104 | | -./src/modelhub/validate.py experiment=quick-af3 debug=default |
| 20 | +### 2. Download model weights |
| 21 | +```bash |
| 22 | +wget http://files.ipd.uw.edu/pub/rf3/rf3_latest.pt |
105 | 23 | ``` |
106 | 24 |
|
107 | | -Note that since we use `hydra`, you could specify additional setup arguments using the command line. For example, by default, we `prevalidate` - running validation at the beginning of training so we develop a baseline and catch any errors (especially out-of-memory errors) before training for a full epoch. If you don't want that behavior, you could override in-line: |
108 | | -``` |
109 | | -./src/modelhub/train.py experiment=quick-af3 debug=default trainer.prevalidate=false |
| 25 | +### 3. run a test prediction |
| 26 | +```bash |
| 27 | +rf3 fold tests/data/5vht_from_json.json |
110 | 28 | ``` |
111 | 29 |
|
112 | | -You can view the flattened Hydra configuration to determine how to best override or add additional arguments by: |
113 | | -- Running training or validation and viewing the pretty-printed file, which looks like: |
114 | | - |
115 | | -- Adding `--cfg job` to your launch command, which prints the config for the application and then exits |
116 | | -
|
117 | | -### Inference |
118 | | -
|
119 | | -To support multiple models and multiple projects, we build an `InferenceEngine` for each use case. For end-users the details of the `InferenceEngine` are not necessary; the appropriate engine can be specified with with `inference_engine` argument. |
120 | | -
|
121 | | -For example, to run the latest AF-3 model with confidence, we can execute (if `cifutils` and `datahub` are in the `PYTHONPATH`): |
122 | | -``` |
123 | | -./src/modelhub/inference.py inference_engine=af3 inputs='./tests/data/example_with_ncaa.json' |
124 | | -``` |
125 | | -
|
126 | | -We can then modify the command by adding/removing arguments with Hydra to our liking; for example, to dump diffusion trajectories and only include one model per CIF file: |
127 | | -``` |
128 | | -./src/modelhub/inference.py inference_engine=af3 inputs='./tests/data/example_with_ncaa.json' dump_trajectories=true one_model_per_file=true |
129 | | -``` |
130 | | -
|
131 | | -More details can be found in the [inference README](src/modelhub/inference_engines/README.md) |
132 | | -
|
133 | | -## Setup |
134 | | -
|
135 | | -> If you are developing at the IPD, then our `shebang` executables will handle the Apptainer dependencies; no need to run the commands below. See the `shebang` section below. |
136 | | -
|
137 | | -### Apptainers |
138 | | -To accelerate development and better contain dependencies, we offer two apptainers: |
139 | | -- `base_apptainer`: Contains all of the development dependencies, but not a static `modelhub` (with corresponding submodules of `cifutils` and `datahub`) |
140 | | -- `inference_apptainer`: Takes the `base_apptainer` as its image, and pip-installs `modelhub` as well (useful for releasing self-contained inference code). The rationale for these apptainers is to provide designers with a stable environment to tackle design problems in. |
141 | | -
|
142 | | -#### Base Apptainer |
143 | | -
|
144 | | -To make the base apptainer, run: |
145 | | -``` |
146 | | -make base_apptainer |
147 | | -``` |
148 | | -from the project root. |
149 | | -
|
150 | | -> NOTE: You will need to adjust the IPD-speciifc paths to frozen copies of the PDB and the CCD |
151 | | -
|
152 | | -#### Inference Apptainer |
153 | | -
|
154 | | -To make a container that contains `cifutils` and `datahub`, and `modelhub`, run: |
155 | | -``` |
156 | | -make inference_apptainer |
157 | | -``` |
158 | | -This will use the `base_apptainer` pointed to by the `shebang` symlink as a base. |
159 | | -
|
160 | | -### Shebang |
161 | | -
|
162 | | -#### General Use |
163 | | -We use `shebang` to help manage and version apptainers. Namely: |
164 | | -- The shebang lines (`#!/bin/bash` ...) at the top of entry point scripts like `train.py` redirect the system to to `scripts/shebang/modelhub_exec.sh` |
165 | | -- The script `modelhub_exec.sh` in turn identifies the correct Apptainer and executes your command |
166 | | -- Apptainers are symlinks in `scripts/shebang` to elsewhere on the DIGS (where they are versioned); thus, when we update apptainers, we must also update the symlink. This allows us to track which apptainers to use for a given branch of the code at any given time (provided you update the symlinks for your branch when you switch out which apptainer you run with!) |
167 | | -
|
168 | | -For example, to launch a dummy training run, one could type (after adding `cifutils` and `datahub` to your `PYTHONPATH`): |
169 | | -``` |
170 | | -cd src/modelhub |
171 | | -./train.py experiment=none-00-dummy |
172 | | -``` |
173 | | -> You may need to adjust the permissions on `train.py` (e.g., `chmod +x train.py`) in order to execute the file like a script. |
174 | | -
|
175 | | -#### Debugging |
176 | | -We also support VSCode-native debugging with Apptainers. To debug: |
177 | | -1. Update your `launch.json` to include `Python: Attach`; for example, add the configuration: |
178 | | - ``` |
179 | | - { |
180 | | - "name": "Python: Attach", |
181 | | - "type": "debugpy", |
182 | | - "request": "attach", |
183 | | - "connect": { |
184 | | - "host": "localhost", |
185 | | - "port": 2345 |
186 | | - } |
187 | | - } |
188 | | - ``` |
189 | | -2. Add any interactive debug breakpoints in VSCode |
190 | | -3. Set the `DEBUG_PORT` to `2345`, and then execute your script with `shebang` like normal. That is: |
191 | | - ``` |
192 | | - export DEBUG_PORT=2345 |
193 | | - ./train.py experiment=none-00-dummy |
194 | | - ``` |
195 | | -4. When prompted in the termal, launch the VSCode debug session (shortcut: `F5`) |
196 | | -
|
197 | | -Happy debugging! |
198 | | -
|
199 | | -
|
200 | | -
|
| 30 | +Details on the exact formatting of the json files are available here: |
0 commit comments