Skip to content

Commit 0e41531

Browse files
author
Rohith Krishna
committed
reinit README
1 parent a0ebd6b commit 0e41531

File tree

1 file changed

+19
-189
lines changed

1 file changed

+19
-189
lines changed

README.md

Lines changed: 19 additions & 189 deletions
Original file line numberDiff line numberDiff line change
@@ -1,200 +1,30 @@
1-
# Modelhub
1+
# Modelforge
22

3-
- [Modelhub](#modelhub)
4-
- [Background](#background)
5-
- [Division of code between Modelhub, Datahub, and Cifutils](#division-of-code-between-modelhub-datahub-and-cifutils)
6-
- [Cifutils](#cifutils)
7-
- [Datahub](#datahub)
8-
- [Training, Validation, and Inference](#training-validation-and-inference)
9-
- [Training and Validation](#training-and-validation)
10-
- [Inference](#inference)
11-
- [Setup](#setup)
12-
- [Apptainers](#apptainers)
13-
- [Base Apptainer](#base-apptainer)
14-
- [Inference Apptainer](#inference-apptainer)
15-
- [Shebang](#shebang)
16-
- [General Use](#general-use)
17-
- [Debugging](#debugging)
3+
## Installation & Usage
184

19-
## Background
5+
Follow these steps to set up **ModelForge** and run a test prediction.
206

21-
This repository constitutes the base for deep-learning method development at the Institute for Protein Design.
7+
---
228

23-
It is symbiotic with two other Institute for Protein Design repositories:
24-
- [cifutils](https://github.com/baker-laboratory/cifutils), which manages input parsing and data cleaning
25-
- [datahub](https://github.com/baker-laboratory/datahub), which manages input featurization and holds our composable `Transform` components
9+
### 1. Install the repository using `uv`
2610

27-
Within this ontology, `modelhub` contains the *architectures*, *training* code, and *inference* endpoints.
28-
29-
## Division of code between Modelhub, Datahub, and Cifutils
30-
31-
Across our codebases, we balance the need to develop quickly with the need to write code that we can continue to maintain and that is easy to understand. We below lay out some thoughts on what code should live where.
32-
33-
We enforce a strict dependency flow of `modelhub` -> (depends on) `datahub` -> (depends on) `cifutils`; it would be a circular anti-pattern to thus import any `datahub` or `modelhub` functions from within `cifutils`.
34-
35-
### Cifutils
36-
37-
[cifutils](https://github.com/baker-laboratory/cifutils) is the most static of our three codebases. Basic parsing functionality, RDKit and other molecular toolkit utilities, and `AtomArray` quality-of-life tools live in this repository.
38-
39-
Examples of `cifutils` functions are:
40-
- All functions related to **parsing structural files from source**; e.g., keeping/removing hydrogens, resolving occupancy, etc.
41-
- Utility functions to manipulate `AtomArrays`, the core API of the `biotite` library, upon which we heavily rely
42-
- Utility functions for common bioinformatics software, such as `RDKit`, that interface with `AtomArrays`
43-
44-
As a foundational library for the Institute for Protein Design, `cifutils` functions most like an open-source codebase. We must keep the code easy-to-understand and easy-to-maintain, both now and into the future. As such, `cifutils`:
45-
- Maintains the **highest code quality standard**, requiring well-documented, easy-to-maintain code with adequate test coverage (we aim for **>85%** coverage)
46-
- **Strictly versions** to minimize breaking changes with downstream repositories
47-
48-
You should write code in `cifutils` if:
49-
- You are are writing **core** `AtomArray`-level level functionality that will be broadly useful, not only to those at the Institute for Protein Design but possibly the wider bioinformatics community (i.e., without dependencies, or even knowledge of, `datahub` or `modelhub`)
50-
- You are willing to spend some additional time to ensure the code is **scalable, well-tested, and maintainable**
51-
52-
Quick-and-dirty experiments that require modifying `cifutils` can be performed by submoduling or cloning the repository and exporting a local path.
53-
54-
### Datahub
55-
56-
[datahub](https://github.com/baker-laboratory/datahub) manages data loading, preprocessing, and featurization pipelines for structure-dependent deep-learning models. We offer three core components: a `Transforms` library, a set of `Preprocessing` scripts, and `Datasets`.
57-
- **Transforms**: A series of composable classes that take as input a dictionary containing sequence- and structure-based data (in the form of an `AtomArray`) and perform arbitrary operations, analogous to TorchVision's [approach](https://pytorch.org/vision/main/transforms.html) for computer vision
58-
- **Preprocessing**: Scripts and functions for common data cleaning and preparation tasks, including specialized pipelines for frequent use cases (e.g., antibodies, clash detection, cleaning PDB data, etc.). Many of these *scripts* output `parquet` files stored to disk that are sampled from at train-time, while the *functions* are called by the scripts to clean, label, or filter the data (e.g., `has_clash()`, etc.)
59-
- **Datasets**: The base `Datasets` and `Sampler` classes used for training, imported by `modelhub`
60-
61-
`datahub` is less static than `cifutils`; however, it still must operate as a stand-alone library that others can continue to build around and upon, even without `modelhub`. We strive to maintain `datahub` like an open-source software project such that others in the lab can easily understand, and build upon, our base components. We focus on **maintainable** and **flexible** code - if a particular `Transform` is bespoke or non-generalizable (at least initially), then the `/projects` folder within `Modelhub` may be a more appropriate place for initial development.
62-
63-
You should write code in `datahub` if:
64-
- You are writing flexible, generic *pre-processing scripts* or *functions* that others in the lab have expressed interest in using (vs. a single-purpose pipeline or feature to test a hypothesis)
65-
- **Example that should live in `datahub`**: You are writing a pre-processing pipeline to label all beta barrels in the PDB. Your scripts, written in a functional manner, may be a good candidate for `datahub/scripts/preprocessing`, so long as you are willing to write them generally and include tests. Similarly, if a single function may be generalizable but the pipeline is bespoke, that single function (with a test) could still be included as a stand-alone element in `datahub`, e.g.,
66-
```python
67-
atom_array_has_beta_barrel(atom_array: AtomArray) -> bool
68-
```
69-
- **Example that should live in `modelhub/projects`**: You have pulled together a script that loads PDB files, includes manual annotations, and saves out to CIF. Such a script may be appropriate for the specific use case but is unlikely to generalize across other use cases.
70-
- You are writing `Transforms` that generalize to additional use cases beyond the current project
71-
- **Example that should live in `datahub`**: Any `Transform` that adds a useful annotation to an `AtomArray` (e.g., annotationg pocket residues, hydrogen bonds, SASA, etc.)
72-
- **Example that should live in `datahub`**: A `Transform` that pads DNA with generated B-form structure, as is done in AF-3; such a `Transform` may be applicable to both structure prediction and design, when proven effective
73-
- **Example that should live in `modelhub/projects`**: A `Transform` that aggregates and/or concatenates features for a bespoke model pipeline
74-
- You are willing to spend some additional time to ensure the code is scalable, well-tested, and maintainable. Otherwise the `projects` folder of `modelhub` may be a more appropriate place in the interim
75-
76-
## Training, Validation, and Inference
77-
78-
> If you are developing at the IPD, our `shebang` executables will take care of identifying and executing with the most up-do-date apptainer. If you are not at the IPD, you will need to ensure you have the appropriate apptainer. See below for details.
79-
80-
NOTE: For Training, Validation, and Inference, we make heavy use of [Hydra](https://hydra.cc/) for configuration management.
81-
82-
Before running any of the below commands, you will need to ensure `datahub` and `cifutils` are in your `PYTHONPATH`. E.g.,
83-
```
84-
export PYTHONPATH="/home/<USER>/projects/datahub/src:/home/<USER>/projects/cifutils/src"
85-
```
86-
87-
### Training and Validation
88-
89-
For Training and Validation, when you execute `train.py` or `validate.py`, you will need to provide an *experiment* Hydra config. Experiments are a Hydra best-practice pattern to enable us to maintain multiple configurations; see more in the [Hydra documentaion](https://hydra.cc/docs/patterns/configuring_experiments/)
90-
and in the `configs/experiment` sub-directory.
91-
92-
For example, to test AF-3 training without confidence, run:
11+
```bash
12+
git clone https://github.com/RosettaCommons/modelforge.git \
13+
&& cd modelforge \
14+
&& uv python install 3.12 \
15+
&& uv venv --python 3.12 \
16+
&& source .venv/bin/activate \
17+
&& uv pip install -e .
9318
```
94-
./src/modelhub/train.py experiment=quick-af3 debug=default
95-
```
96-
97-
**Explanation:**
98-
- `./src/modelhub/train.py` — we execute our `train.py` like a bash executable, which triggers the `shebang` code to find the correct apptainer. It's equivalent to `apptainer exec --nv /path/to/apptainer python ./src/modelhub/train.py`
99-
- `experiment=quick-af3` — we identify the experiment we want to use for training; in this case, `quick-af3`, which can be viewed at `configs/experiment/quick-af3.yaml`. This experiment is a simple test config for AF-3 that loads and runs more rapidly that the full training config
100-
- `debug=default` - a setting letter Hydra know we are debugging; when we debug, we perform some automatic time-savings like setting a small diffusion batch size and crop size. You could remove this line if you don't want those options. You can explore more about various `debug` options in `config/debug`
10119

102-
For validation only, run the following:
103-
```
104-
./src/modelhub/validate.py experiment=quick-af3 debug=default
20+
### 2. Download model weights
21+
```bash
22+
wget http://files.ipd.uw.edu/pub/rf3/rf3_latest.pt
10523
```
10624

107-
Note that since we use `hydra`, you could specify additional setup arguments using the command line. For example, by default, we `prevalidate` - running validation at the beginning of training so we develop a baseline and catch any errors (especially out-of-memory errors) before training for a full epoch. If you don't want that behavior, you could override in-line:
108-
```
109-
./src/modelhub/train.py experiment=quick-af3 debug=default trainer.prevalidate=false
25+
### 3. run a test prediction
26+
```bash
27+
rf3 fold tests/data/5vht_from_json.json
11028
```
11129

112-
You can view the flattened Hydra configuration to determine how to best override or add additional arguments by:
113-
- Running training or validation and viewing the pretty-printed file, which looks like:
114-
![alt text](assets/example_config.png)
115-
- Adding `--cfg job` to your launch command, which prints the config for the application and then exits
116-
117-
### Inference
118-
119-
To support multiple models and multiple projects, we build an `InferenceEngine` for each use case. For end-users the details of the `InferenceEngine` are not necessary; the appropriate engine can be specified with with `inference_engine` argument.
120-
121-
For example, to run the latest AF-3 model with confidence, we can execute (if `cifutils` and `datahub` are in the `PYTHONPATH`):
122-
```
123-
./src/modelhub/inference.py inference_engine=af3 inputs='./tests/data/example_with_ncaa.json'
124-
```
125-
126-
We can then modify the command by adding/removing arguments with Hydra to our liking; for example, to dump diffusion trajectories and only include one model per CIF file:
127-
```
128-
./src/modelhub/inference.py inference_engine=af3 inputs='./tests/data/example_with_ncaa.json' dump_trajectories=true one_model_per_file=true
129-
```
130-
131-
More details can be found in the [inference README](src/modelhub/inference_engines/README.md)
132-
133-
## Setup
134-
135-
> If you are developing at the IPD, then our `shebang` executables will handle the Apptainer dependencies; no need to run the commands below. See the `shebang` section below.
136-
137-
### Apptainers
138-
To accelerate development and better contain dependencies, we offer two apptainers:
139-
- `base_apptainer`: Contains all of the development dependencies, but not a static `modelhub` (with corresponding submodules of `cifutils` and `datahub`)
140-
- `inference_apptainer`: Takes the `base_apptainer` as its image, and pip-installs `modelhub` as well (useful for releasing self-contained inference code). The rationale for these apptainers is to provide designers with a stable environment to tackle design problems in.
141-
142-
#### Base Apptainer
143-
144-
To make the base apptainer, run:
145-
```
146-
make base_apptainer
147-
```
148-
from the project root.
149-
150-
> NOTE: You will need to adjust the IPD-speciifc paths to frozen copies of the PDB and the CCD
151-
152-
#### Inference Apptainer
153-
154-
To make a container that contains `cifutils` and `datahub`, and `modelhub`, run:
155-
```
156-
make inference_apptainer
157-
```
158-
This will use the `base_apptainer` pointed to by the `shebang` symlink as a base.
159-
160-
### Shebang
161-
162-
#### General Use
163-
We use `shebang` to help manage and version apptainers. Namely:
164-
- The shebang lines (`#!/bin/bash` ...) at the top of entry point scripts like `train.py` redirect the system to to `scripts/shebang/modelhub_exec.sh`
165-
- The script `modelhub_exec.sh` in turn identifies the correct Apptainer and executes your command
166-
- Apptainers are symlinks in `scripts/shebang` to elsewhere on the DIGS (where they are versioned); thus, when we update apptainers, we must also update the symlink. This allows us to track which apptainers to use for a given branch of the code at any given time (provided you update the symlinks for your branch when you switch out which apptainer you run with!)
167-
168-
For example, to launch a dummy training run, one could type (after adding `cifutils` and `datahub` to your `PYTHONPATH`):
169-
```
170-
cd src/modelhub
171-
./train.py experiment=none-00-dummy
172-
```
173-
> You may need to adjust the permissions on `train.py` (e.g., `chmod +x train.py`) in order to execute the file like a script.
174-
175-
#### Debugging
176-
We also support VSCode-native debugging with Apptainers. To debug:
177-
1. Update your `launch.json` to include `Python: Attach`; for example, add the configuration:
178-
```
179-
{
180-
"name": "Python: Attach",
181-
"type": "debugpy",
182-
"request": "attach",
183-
"connect": {
184-
"host": "localhost",
185-
"port": 2345
186-
}
187-
}
188-
```
189-
2. Add any interactive debug breakpoints in VSCode
190-
3. Set the `DEBUG_PORT` to `2345`, and then execute your script with `shebang` like normal. That is:
191-
```
192-
export DEBUG_PORT=2345
193-
./train.py experiment=none-00-dummy
194-
```
195-
4. When prompted in the termal, launch the VSCode debug session (shortcut: `F5`)
196-
197-
Happy debugging!
198-
199-
200-
30+
Details on the exact formatting of the json files are available here:

0 commit comments

Comments
 (0)