-
-
Notifications
You must be signed in to change notification settings - Fork 213
[ENH] V1 → V2 API Migration - core structure #1576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1576 +/- ##
==========================================
- Coverage 52.75% 49.68% -3.08%
==========================================
Files 36 46 +10
Lines 4333 4567 +234
==========================================
- Hits 2286 2269 -17
- Misses 2047 2298 +251 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| cache: CacheConfig | ||
|
|
||
|
|
||
| settings = Settings( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move the settings to the individual classes. I think this design introduces too high coupling of the classes to this file. You cannot move the classes around, or add a new API version without making non-extensible changes to this file here - because APISettings will require a constructor change and new classes it accepts.
Instead, a better design is to apply the strategy pattern cleanly to the different API definitions - v1 and v2 - and move the config either to their __init__, or a set_config (or similar) method.
fkiraly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall really great, I have a design suggestion related to the configs.
The config.py file and the coupling on it breaks an otherwise nice strategy pattern.
I recommend to follow the strategy pattern cleanly instead, and move the configs into the class instances, see above.
This will make the backend API much more extensible and cohesive.
| key="...", | ||
| ), | ||
| v2=APIConfig( | ||
| server="http://127.0.0.1:8001/", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be hardcoded? I guess this is just for your local development
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is hard-coded, they are the default values though the local endpoints will be replaced by remote server when deployed hopefully before merging this in main
|
|
||
| if strict: | ||
| return v2 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a previous commit the 'FallbackProxy' was used here. Do we still need this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this because of the ruff errors. I'll put them back and fix the pre-commit when the class is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I agree with the suggested changes. This seems like a reasonable way to provide a unified interface for two different backends, and also separate out some concerns that were previously coupled or scattered more than they should (e.g., caching, configurations).
My main concern is with the change to caching behavior. I have a minor concern over the indirection APIContext introduces (perhaps I misunderstood its purpose), and the introduction of allowing Response return values.
In my comments you will also find some things that may already have been "wrong" in the old implementation. In that case, I think it simply makes sense to make the change now so I repeat it here for convenience.
openml/_api/http/client.py
Outdated
| from openml._api.config import APIConfig | ||
|
|
||
|
|
||
| class CacheMixin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ttl should probably heavily depend on the path. If we do end up using caching at this level, we should use the Cache-Control HTTP Header response so the server can inform us how long to keep it in cache for (something that, I believe, neither servers do right now). A dataset search query can change if any dataset description changes (to either be now included or excluded), so caching probably shouldn't even be on by default for such type of queries. Dataset descriptions might change, but likely not very frequently. Dataset data files or computed qualities should (almost?) never change. This is the reason that the current implementation only caches description, features, qualities, and the dataset itself.
With this implementation, you also introduce some new issues:
- What if the paths change, or even the query parameters? there is now dead cache. Do we now add cache cleanup routines? How does openml-python know what is no longer valid if they were responses with high TTL?
- URLs may be (much) longer than the default max path of Windows (260 characters). If I'm not mistaken, this will lead to an issue unless you specifically work around it.
- More of an implementation detail, but authenticated and unauthenticated requests are not differentiated. If a user accidentally makes an unauthenticated request, gets an error, and then authenticates they would still get an error.
| tasks=TasksV1(v1_http), | ||
| ) | ||
|
|
||
| if version == "v1": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: supported versions should be encoded in an Enum. This helps function signatures (type checking, code completion) and reduces chance for erroneous input.
| from openml._api.runtime.core import APIContext | ||
|
|
||
|
|
||
| def set_api_version(version: str, *, strict: bool = False) -> None: | ||
| api_context.set_version(version=version, strict=strict) | ||
|
|
||
|
|
||
| api_context = APIContext() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear what the function of the APIContext is here. Why do we need it and cannot just use backend directly? E.g.:
| from openml._api.runtime.core import APIContext | |
| def set_api_version(version: str, *, strict: bool = False) -> None: | |
| api_context.set_version(version=version, strict=strict) | |
| api_context = APIContext() | |
| from openml._api.runtime.core import build_backend | |
| _backend = build_backend("v1", strict=False) | |
| def set_api_version(version: str, *, strict: bool = False) -> None: | |
| global _backend | |
| _backend = build_backend(version=version, strict=strict) | |
| def backend() -> APIBackend: | |
| return _backend | |
If it is just to avoid the pitfall where users assign the returned value to a local variable with a scope that is too long lived, then the same would apply if users would assign api_context.backend to a variable. We could instead extend the APIBackend class to allow updates to its attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you, it's not really useful, I am going to iterate over the design and will keep this in mind
| server: str | ||
| base_url: str | ||
| key: str | ||
| timeout: int = 10 # seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Add a unit suffix (timeout_seconds) so the unit is clear without navigating to the source.
ps. I also considered typing it as datetime.timedelta but considering you probably only use it in seconds and there is a real risk of developers erroneously using datetime.timedelta.seconds instead of datetime.timedelta.total_seconds(), I think keeping it an integer is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
| class ConnectionConfig: | ||
| retries: int = 3 | ||
| delay_method: DelayMethod = "human" | ||
| delay_time: int = 1 # seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: here too, including the unit makes sense (delay_time_seconds)
| @dataclass | ||
| class CacheConfig: | ||
| dir: str = "~/.openml/cache" | ||
| ttl: int = 60 * 60 * 24 * 7 # one week |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Considering the TTL of the HTTP standard is already defined in seconds, maybe it is fine to exclude it in the variable name? Though as noted above there is a discussion to be had about having this as a cache level property in the first place.
For future reference, setting the value to timedelta(weeks=1).total_seconds() is preferred over the arithmetic+comment.
|
|
||
| @dataclass | ||
| class CacheConfig: | ||
| dir: str = "~/.openml/cache" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default should continue to respect XDG_CACHE_HOME.
Towards #1575
This PR sets up the core folder and file structure along with base scaffolding for the API v1 → v2 migration.
It includes:
*V1,*V2)No functional endpoints are migrated yet. This PR establishes a stable foundation for subsequent migration and refactor work.