Skip to content

Commit 5e21ae9

Browse files
committed
Some architecture notes
1 parent 4c10c31 commit 5e21ae9

File tree

1 file changed

+96
-1
lines changed

1 file changed

+96
-1
lines changed

docs/ARCHITECTURE.md

Lines changed: 96 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,96 @@
1-
# Design and open questions about libsyntax.
1+
# Design and open questions about libsyntax
2+
3+
4+
The high-level description of the architecture is in RFC.md. You might
5+
also want to dig through https://github.com/matklad/fall/ which
6+
contains some pretty interesting stuff build using similar ideas
7+
(warning: it is completely undocumented, poorly written and in general
8+
not the thing which I recommend to study (yes, this is
9+
self-contradictory)).
10+
11+
## Tree
12+
13+
The centerpiece of this whole endeavor is the syntax tree, in the
14+
`tree` module. Open questions:
15+
16+
- how to best represent errors, to take advantage of the fact that
17+
they are rare, but to enable fully-persistent style structure
18+
sharing between tree nodes?
19+
20+
- should we make red/green split from Roslyn more pronounced?
21+
22+
- one can layout nodes in a single array in such a way that children
23+
of the node form a continuous slice. Seems nifty, but do we need it?
24+
25+
- should we use SoA or AoS for NodeData?
26+
27+
- should we split leaf nodes and internal nodes into separate arrays?
28+
Can we use it to save some bits here and there? (leaves don't need
29+
first_child field, for example).
30+
31+
32+
## Parser
33+
34+
The syntax tree is produced using a three-staged process.
35+
36+
First, a raw text is split into tokens with a lexer. Lexer has a
37+
peculiar signature: it is an `Fn(&str) -> Token`, where token is a
38+
pair of `SyntaxKind` (you should have read the `tree` module and RFC
39+
by this time! :)) and a len. That is, lexer chomps only the first
40+
token of the input. This forces the lexer to be stateless, and makes
41+
it possible to implement incremental relexing easily.
42+
43+
Then, the bulk of work, the parser turns a stream of tokens into
44+
stream of events. Not that parser **does not** construct a tree right
45+
away. This is done for several reasons:
46+
47+
* to decouple the actual tree data structure from the parser: you can
48+
build any datastructre you want from the stream of events
49+
50+
* to make parsing fast: you can produce a list of events without
51+
allocations
52+
53+
* to make it easy to tweak tree structure. Consider this code:
54+
55+
```
56+
#[cfg(test)]
57+
pub fn foo() {}
58+
```
59+
60+
Here, the attribute and the `pub` keyword must be the children of
61+
the `fn` node. However, when parsing them, we don't yet know if
62+
there would be a function ahead: it very well might be a `struct`
63+
there. If we use events, we generally don't care about this *in
64+
parser* and just spit them in order.
65+
66+
* (Is this true?) to make incremental reparsing easier: you can reuse
67+
the same rope data structure for all of the original string, the
68+
tokens and the events.
69+
70+
71+
The parser also does not know about whitespace tokens: it's the job of
72+
the next layer to assign whitespace and comments to nodes. However,
73+
parser can remap contextual tokens, like `>>` or `union`, so it has
74+
access to the text.
75+
76+
And at last, the TreeBuilder converts a flat stream of events into a
77+
tree structure. It also *should* be responsible for attaching comments
78+
and rebalancing the tree, but it does not do this yet :)
79+
80+
81+
## Error reporing
82+
83+
TODO: describe how stuff like `skip_to_first` works
84+
85+
86+
## Validator
87+
88+
Parser and lexer accept a lot of *invalid* code intentionally. The
89+
idea is to post-process the tree and to proper error reporting,
90+
literal conversion and quick-fix suggestions. There is no
91+
design/implementation for this yet.
92+
93+
94+
## AST
95+
96+
Nothing yet, see `AstNode` in `fall`.

0 commit comments

Comments
 (0)