mongo/docs/modularity.md
Steve McClure 32e8f260de SERVER-124136 Format markdown via prettier: wrap lines and use width of 100 (#52231)
GitOrigin-RevId: 3305c1e2ee3a6a2c3a5b2b7883b0f491a59ed646
2026-04-21 19:20:11 +00:00

364 lines
23 KiB
Markdown

# Modules API
## What is a module
A module:
- Provides a coherent public API
- Has internal details that are not intended to be directly accessed from outside the module
- Is a set of files covering the API (headers), implementations (headers and cpp files), and tests
### Submodules
TODO
## Why are we doing this?
Having a clear delineation between public and private APIs for each module will improve the
maintainability and velocity of our codebase. Teams will have more freedom to evolve their internal
implementation details without affecting consumers. Consumers will benefit from knowing what APIs
are intended for their consumption.
## Assigning files to modules
The file `modules_poc/modules.yaml` contains a list of modules, each containing a list of files.
Each file must be contained in only one module. Note that module assignment is not required to map
neatly to team ownership.
In cases where multiple globs match a file, the current rule is that the longest glob wins. This is
used as a simpler-to-implement version of most-specific glob wins, which we may switch to in the
future.
## How do I mark API visibility?
This section will just describe the basic process. Later sections will cover the tooling available
to help, along with caveats to be aware of.
First read the documentation in
[src/mongo/util/modules.h](https://github.com/mongodb/mongo/blob/master/src/mongo/util/modules.h) for
the canonical list and description of visibility levels. As a brief overview of the main levels from
least to most restrictive:
- `OPEN`: This is available for usage _and inheritance_ from anywhere in the codebase
- `PUBLIC`: This is available for usage from anywhere in the codebase. For types, subclasses may
only be defined in the same module.
- `NEEDS_REPLACEMENT` and `USE_REPLACEMENT(...)`: These are collectively considered "unfortunately
public" and are available for use, but should be avoided
- `PARENT_PRIVATE`: This is similar to `PRIVATE`, but allows usage from any file in the parent
module, including other submodules
- `PRIVATE`: This may only be used from the current module or one of its submodules
- `FILE_PRIVATE`: This may only be used from the current "file family" (roughly, header \+ cpp \+
tests). It may not be used by other files, even from the same module.
You can think of public vs private similarly to how you would the sections of a `class`: they
indicate whether something is intended to be part of the API or an implementation detail. The
difference is that they apply at a wider granularity of code than a single class, with
implementation details available to either the full module (and its submodules) for `PRIVATE` or the
file family for `FILE_PRIVATE`.
The macros in that header file are attached to declarations and set the visibility level for that
declaration and all of its "semantic children"[^1]. The macros are C++ attributes which means that
they need to go in specific places that differ based on what is being marked (for templates, the
location does not change and is always somewhere after the `template <...>` part):
- `MONGO_MOD_PUBLIC;` by itself as the first line after includes in a header sets the default for
that header (only `PUBLIC`, `PARENT_PRIVATE`, and `FILE_PRIVATE` are allowed here)
- `namespace MONGO_MOD mongo {` (this does not work with nested namespaces in a single declaration
like `namespace mongo::repl`)
- `class MONGO_MOD Foo {` (Ditto for `enum`, `struct`, and `union`)
- `MONGO_MOD void func(...);`
- `MONGO_MOD int var;`
- `concept isFooable MONGO_MOD {`
For the cases where it goes at the beginning of the line, if clang-format chooses an unfortunate
place to break the line, it usually helps to undo the formatting then put the macro on its own line
above the declaration.
APIs are marked one header at a time, by including `"mongo/util/modules.h"` in the header. This
causes the header to be treated as "modularized" which has the following effects:
- All declarations in that header (not transitive includes) default to `PRIVATE`, meaning that the
public API is what must be marked.
- Members in `private:` sections in classes default to `PRIVATE`, regardless of the visibility of
the class. The only way the language would allow them to be used from outside of the module is if
you have cross-module friendships, which should generally be avoided. If needed temporarily, favor
`NEEDS_REPLACEMENT` over `PUBLIC` for these declarations.
- Declarations ending in `_forTest` default to `FILE_PRIVATE` to support the common case where they
are only intended for testing that class. If they are actually intended to support testing of
consumers, not just the type they are defined on, they can be explicitly given `PUBLIC` or
`PRIVATE` visibility.
- Internal and detail namespaces default to `PRIVATE` and cannot be made less restricted, but can
still be marked as `FILE_PRIVATE`. Individual declarations within the namespace can be exposed as
necessary, but they cannot be exposed in bulk without changing the name of the namespace to
something that doesn't imply private.
For internal headers of a module which do not contribute to its public API, simply including
`modules.h` is sufficient. There is a [tool](#the-private-header-marker) to automate this process.
You may additionally want to consider whether any APIs should be marked `FILE_PRIVATE`, but that is
optional.
For IDL files, you mark visibility of whole types (`struct`, `enum`, and `command`) with the
`mod_visibility` option. The value should be the same as one of the `MONGO_MOD` macros, but
lowercase and without the prefix, for example `mod_visibility: public`. You can set the default
visibility for all types in that IDL file by putting that in the `global:` section. You cannot
control visibility of individual functions within the type. Please let us know if you have a
compelling use case for this.
## What tooling exists to help me?
Note that all tooling should be run from within a properly set-up python virtual environment. This
includes running `buildscripts/poetry_sync.sh` to ensure you have the correct dependencies.
### The scanner and merger
The merger generates a cross reference of all first-party usages of first-party code and stores it
in `merged_decls.json`, which is used by the rest of our tooling. It is also where we validate that
there are no disallowed accesses. It will be invoked for you by the browser when you ask it to
rescan, or you can also manually run it as `modules_poc/merge_decls.py`. If you are interested in
analyzing that file, [`jq`](https://jqlang.org/) is a powerful tool, or you can just write some
python.
As a rather extreme example of what you can do with `jq`, here is how the progress reports are
generated:
```shell
# For each mod (and TOTAL):
# For each file:
# consider it marked if it has no UNKNOWNs
# Compute a done percentage
# Format to a nice string
jq 'map(., .mod = "TOTAL") | group_by(.mod)[] | group_by(.loc | split(":")[0]) | {mod: .[0].[0].mod, total: length, marked: map(select(any(.visibility == "UNKNOWN") | not)) | length} | .done = (1000 * .marked / .total | round) / 10 | "\(.mod): \(" " * (.mod | 40-length)) \(.done)% (\(.marked) / \(.total))"' -r merged_decls.json
```
Internally, the merger will internally invoke `bazel build --config=mod-scanner //src/mongo/...` to
run the scanner over the whole codebase (or the parts that have changed since the last scan), taking
advantage of bazel remote execution to achieve very high levels of parallelism.
### The browser
The main piece of tooling to run is the browser, which is launched by running
`modules_poc/browse.py`. If you haven't scanned the codebase recently, it will offer to run it for
you which will take a few minutes. After modifying the source code, you can rescan at any time by
pressing `r`. It will only rescan files that have been modified or that transitively include
modified headers.
The browser is primarily intended to assist in labeling public APIs, so the files are sorted with
the most number of unlabeled declarations ("unknowns") first. You can search for a file by pressing
`f` or press `m` to filter the files by module.
The list of available key bindings is shown on the right. You can toggle that by pressing `?`. Other
keybinding of note are that you can press `g` to go to the currently highlighted declaration or
location in your editor (only when running in the vscode or nvim terminal), and `p` to toggle an
inline preview of the location within the browser. You can press `Tab ↹` to toggle between the tree
and the code preview. The mouse is fully supported for scrolling and expanding rows in the tree, and
there are aliases for some basic vim keybinds (`hjkl/`).
### The private header marker
Once you have scanned the codebase and produced a `merged_decls.json`,
`modules_poc/private_headers.py` can be used to find all header and IDL files where there are no
currently detected external usages and automatically mark them as fully private to the module. This
does not necessarily mean that all automatically marked headers are intended to be private. A human
should review to ensure that the marked headers match intent. You can pass flags to filter on
any/all of module, owning team, or path glob. For headers matching the filter, the script will also
warn of usages of `_forTest` external to the file family that may need to be marked `PRIVATE` to
make them available to the whole module since they default to only being available to the file
family for marked headers.
Make sure to run `buildscripts/clang_format.py format-my` or `bazel run format` after using it to
modify any C++ files.
Example usage:
```shell
./modules_poc/private_headers.py --team=server_programmability --module=core --glob="src/mongo/executor/*"
```
`--dry-run` can be added to view all of the changes without applying them.
### The PR comment generator
You can run `modules_poc/mod_diff.py` to output a brief summary of all of the API (including
visibility levels and usages counts) for each file modified in your branch. When putting up a PR to
mark API visibility, you should add a comment with its output to the PR as an aide to reviewers. The
output is intended to be close enough to C++ that you should put it in a ` ```cpp ` block when
making your PR comment to make it more readable. You can also pipe it through `bat -lcpp` to make it
colorful locally. Note that it will use the last scan output, so if you've modified any headers, you
should run a rescan prior to running this tool.
## Workflow
The general workflow for each PR will generally be the same:
1. Ensure that you are in a python virtualenv, creating one if needed, and run
`buildscripts/poetry_sync.sh` to update python deps.
2. Run [the merger](#the-scanner-and-merger) to scan the code base: `modules_poc/merge_decls.py`
3. Mark some headers
4. Rerun the merger to ensure that there are no violations, and update `merged_decls.json`
5. Run [the pr comment generator](#the-pr-comment-generator) to show the APIs that you have marked
- Look through this to ensure that everything is as you expect.
6. Put up a PR and include the generated comment in a ` ```cpp ` block
- I suggest keeping PRs small (say, no more than 10 files at a time) so that they are manageable
by reviewers. As an exception it seems reasonable to auto-mark many headers as private in a
single PR, as long as those PRs are separate from those containing any manual marking.
When first starting to mark a module, I suggest running the
[`modules_poc/private_headers.py`](#the-private-header-marker) script with `--dry-run` (or `-n`) and
`--module=YOUR_MODULE`. For larger modules (in particular, the `query` mega module) you may want to
pass a `--glob` so that you can focus on a smaller subset of the code initially. That will give you
an overview of the files that are used from outside your module (which contain defacto public APIs
today) and those that do not (which can automatically be marked as private implementation details).
If all of the defacto private headers seem like they should be private, you can remove the dry-run
flag to have it automatically mark them as private. Be sure to validate that their contents are
actually intended to be private. Remember that the point of having a human doing the marking is to
ensure that we correctly capture intent. You can optionally mark implementation details within each
header as `FILE_PRIVATE`, if you would like to prevent them from being used elsewhere even within
the module.
You can then open [the browser](#the-browser) (`modules_poc/browse.py`) to look at the remaining
headers. It will show you what is used and from where. It will be particularly useful for things
that seem like they should be private, but are being used externally.
### What should I do when an internal API is currently being used?
1. If it is only used from a small number of external files, first check if those files should
actually belong to your module. We tried to correctly map all files in phase 1, but some files
may have been assigned to the wrong module. If that happens, try adjusting the globs in
`modules_poc/modules.yaml` to move them.
2. If there is already a public API that callers should use instead, mark it as
`USE_REPLACEMENT(better_api)`. The argument accepts any C++ tokens, but the intent is where
possible to use the name of the replacement. This will generate a ticket for all teams using that
code.
1. If there are very few users, consider just cleaning them up.
3. Reconsider making this API public if other modules need its functionality, and this is the only
way to get it.
4. Otherwise, if there is no public API that fulfills the needs of the callers, but you don't want
the current API to remain public long-term, use `NEEDS_REPLACEMENT`. This will generate a ticket
for the team that owns that code.
1. If the API was "obviously" intended to be private (eg it is in a `details` namespace) and
callers would be reasonably able to implement the functionality themselves, possibly by
writing their own version, it seems acceptable to use
`USE_REPLACEMENT(do not use internal details)`
## Caveats and Limitations
**OVERARCHING GUIDELINE**: Always try to mark declarations correctly according to intent, even if it
will not be enforced by the current tooling. This is both to provide the correct information to
human readers, as well as to avoid issues if we improve the tooling in the future to eliminate these
limitations
The rest of this section is fairly technical and probably not necessary for most readers unless they
notice something "weird" going on and want to dive into why. Most of these limitations are more
likely to affect the core modules since most of the rest of our code does not expose APIs via macros
and templates or have APIs only consumed by templates, and those are where most of these issues come
up.
- We do not track usages of namespaces at all, only the declarations within namespaces. When a
namespace is marked with a visibility, it does not affect the visibility of the namespace itself
(since it doesn't have one), it sets the default visibility for all declarations within **that
namespace block**. Each time a namespace is reopened it is a separate block and the visibility
markers on other blocks of the same namespace do not apply.
- The scanner only knows about declarations that it sees being used. For implementation reasons, it
only discovers declarations by seeing what every usage is using. This can either cause or be
caused by other limitations.
- Usages in templates may not be seen. This is especially the case for "dependent types and values"
which are things that are not known by the compiler before the template is instantiated.
- This is a problem for functions where any arguments are dependent if it can't figure out which
overload will be selected. It is even worse for free-functions called unqualified (`f(blah)`
rather than `ns::f(blah)` or `x.f(blah)`) since due to ADL, overload resolution is _always_
delayed for them.
- Everything that results from a macro expansion is treated as-if it was written at the point of
expansion. This applies to both declarations and usages. If you have an API that should only be
used via the defined macros, mark it as `MOD_PUBLIC_FOR_TECHNICAL_REASONS` to signal to readers
that they should avoid direct usage, even if the tooling won't prevent it. We may improve this in
the future.
- Template variables are completely ignored due to some unfortunate clang bugs. Still, try to mark
them correctly since we may change this in the future.
- Method calls are assigned to the static type at the call site. This has two important effects:
- A subclass's overridden method may seem unused if it is only used via calls through a base class
pointer/reference
- Calls through a base class pointer/reference count as calls of that class's method, not of the
interface's
- Defaulted members (methods, ctors, dtors) are treated as usages of the class itself, regardless of
whether they implicitly or explicitly defaulted. This is because clang does not provide an API to
distinguish between those cases.
- Template normalization woes: we try really hard to report declarations as the template `foo<T>`
rather than separate instantiations like `foo<int>`, `foo<string>`, etc, **unless** they are
explicitly specialized, meaning that the instantiation has its own definition different from the
main template. Unfortunately, clang does a bad job at this and we have a number of kludgy
workarounds. The most important effects:
- Explicit specializations of function and variable templates are ignored and always converted to
the primary template.
- We do treat explicit specializations of types as separate (using the heuristic of having a
separate location than the main template), because they can have a different shape and API than
the main template. In general they should probably have the same visibility though, unless the
instantiation is using a private type which should be unavailable to consumers anyway.
- Clang assigns many locations to the site of explicit template instantiations and extern template
declarations, even when there is a better location that it can see. Luckily these are fairly
rare.
- Sometimes clang reports the resolved destination of `using` declarations and type alias, but
usually it reports the `using` declaration itself. A few notable cases (these are trends and may
not be absolute\!)
- `using Base::foo;` to expose a member of a base class is resolved as a usage of `Base::foo`
rather than `Derived::foo`. This is especially notable when the `Base` class is intended to be a
private implementation detail. You will need to mark all exposed methods as public.
- `using Base::Base;` to pull in the base constructors is the opposite and is recorded as a usage
of `Derived::Base(args)`, which is odd because such a declaration doesn't actually exist.
- Internal/details namespaces (currently defined as matching the regex `(detail|internal)s?$`)
implicitly have implicit default visibility of private if `modules.h` is included. It is not
possible to give the namespace a public visibility, but you can restrict it further with
`FILE_PRIVATE`. If you want declarations inside it to be usable from outside your module you must
mark children of the namespace explicitly, or rename it to not use a name that implies that it is
for internal usage only. A somewhat common case will be marking internal declarations that are
only intended to be used via macros with `PUBLIC_FOR_TECHNICAL_REASONS`.
- Be very careful with forward declarations. Try to avoid them wherever possible (unless there is a
significant benefit). Especially avoid forward declaring anything from another module\! Where
forward declarations must be used, make sure that they have the same visibility as the real
definition. As an exception, if every TU that sees the forward declaration will also see the
definition it is OK to omit marking the forward definition. This may happen when they are both in
the same header, or the forward declaration is in a private implementation detail header which is
included by the defining header. Be aware of the implicit visibility marking which also applies to
forward declaration, if they are the only declaration seen in the TU.
- Never forward declare functions to avoid including a header. They are much more problematic than
types, both in general in C++ and specifically for this tooling.
- We try to use the definition location for types defined in headers, but the "canonical" location
(clang's term for the first declaration seen in the current TU) for everything else. If the type
is defined in a .cpp, we use the canonical location.
- We only consider declarations in headers, never in .cpp files.
- Be mindful of `_forTest` functions. They default to `FILE_PRIVATE` since they are typically
intended only for use when testing the type they are defined on, not when testing consumers. In
the cases where they _are_ intended as part of the API for testing consumers, you can explicitly
mark them `PUBLIC` or `PRIVATE` depending on whether they should be usable from outside your
module or not.
- Things used implicitly (eg implicit conversion operators) are still counted as usages even if they
are not specifically named at the call site
- When merging information from multiple TUs, definitions always replace the metadata gathered from
TUs that only saw a declaration.
- Note that we aren't guaranteed to see every definition, in particular for functions that are not
called from the TU that they are defined in. So this cannot be used to find places where we
deleted the definition but forgot to delete the declaration (we wouldn't see them anyway, since
we only track things that are used, and undefined things can't really be used, except trivially,
without breaking the build).
- `private` members of classes are implicitly `PRIVATE`, and must be explicitly marked otherwise if
desired. They should probably never be made `PUBLIC` since that implies cross-module friendship.
In the few places where we have that today, they have been made one of the flavors of
unfortunately public: `NEEDS_REPLACEMENT` or `USE_INSTEAD`.
- `public` members of `private` types do not inherit the implicit `PRIVATE` and follow the normal
rule of looking for their nearest semantic parent with an explicit marker. That means that they
may be `PUBLIC`. However, the language rules still apply and as long as an instance of the type
is never handed to consumers they will have no way of accessing those members.
- `protected` members do not default to `PRIVATE`, but because we only allow subclassing from
`OPEN` classes, the language visibility rules will disallow access from outside the module
unless you choose to allow it by use `OPEN` classes or `friend`s. Note that making any subclass
`OPEN` exposes all `protected` members of parents unless they are marked `PRIVATE`.
- `friend` declarations are mostly ignored, except when they are a definition. So the definitions
using the "hidden friend" pattern are tracked, but we ignore it if the definition is in a cpp
file.
[^1]:
Clang distinguishes between "semantic" and "lexical" parents. The primary differences are that
members of classes (including member types) are semantic children of the class even when defined
out of line, and conversely `friend` declarations are not, and instead are considered semantic
children of the nearest namespace.