-
-
Notifications
You must be signed in to change notification settings - Fork 477
Introduce per geometry and overall limits on number of expire tiles #2449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0552bda to
af89bc8
Compare
| return m_tiles.empty(); | ||
| } | ||
|
|
||
| quadkey_list_t expire_output_t::get_tiles() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't that, strictly speaking, go into a mutex as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't called in the place where there are multiple threads, but to be safe it is better to do that, you are right.
| void expire_tiles_t::expire_tile(uint32_t x, uint32_t y) | ||
| { | ||
| // Only try to insert to tile into the set if the last inserted tile | ||
| // is different from this tile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason to drop this optimisation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have any benchmarks that prove that it makes sense, so it might have been premature. And it complicated the code. It is also unclear whether this should now be done for tiles inside a single geometry or for tiles "shared" between geometries which means there are different places where this could go. I opted to remove it here and maybe bring it back after all the other refactoring of the expire code is done if it seems useful then.
src/osm2pgsql-expire.cpp
Outdated
|
|
||
| if (cfg.zoom == 0) { | ||
| throw std::runtime_error{ | ||
| "You have to set the zoom level ith -z, --zoom"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ith -> with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
The number of tiles to be expired can be quite large if the input geometries are large or if there are many geometries. Numbers of tiles in the billions can crash osm2pgsql because it runs out of memory. Such large numbers can also overwhelm any kind of re-rendering mechanism run after osm2pgsql to bring tiles up to date. In day-to-day processing this should not happen, but it can happen due to vandalism or misconfiguration. To protect against this problem, this change introduces limits on the number of tiles that can be affected by a single geometry and the overall number of tiles that an expire output will generate for each run of osm2pgsql. * If a single geometry would result in the expire of more than `max_tiles_geometry` this geometry will be ignored for the purposes of expiry. Note that the geometry will still be written to the database, but no tiles will be added to the expire output. * If the number of tiles generated during a single run of osm2pgsql for an expire output grows beyond `max_tiles_overall`, no further tiles will be written to this output. Limits are per expire output of which you can have several. The limits can be set in the flex expire output configuration but sensible defaults are provided. For the (legacy) expire output configured on the command line with the `-e` and `-o` options, the settings can not be changed, you will always get the default values. To choose the default values for these settings I looked at real-world values as follows: * Russia has one of the largest boundaries in the planet. Expiry (boundary only) on zoom level 14 affects 94144 tiles, on z15 190168 tiles, on z16 383465 tiles. For typical raster tiles using 8x8 meta tiles expiry on z16 is equivalent to showing z19 tiles. So 500,000 tiles seems to be a useful limit for `max_tiles_geometry`. * For expiring the area I looked at the Greenland icesheet, which needs more than 8 million tiles on z14. At least for vector tiles this is good enough, for raster tiles we might need more though. * For `max_tiles_overall`: Paul Norman analyzed the number of tiles expired by typical minutely updates in https://www.openstreetmap.org/user/pnorman/diary/403266. For zoom level 14 the most he got was 119801 tiles. The same analysis also shows that for longer time frames (checked were 2 minutes and 5 minutes, but the same should be true for larger intervals) the number of tiles doesn't go up because these huge numbers only happen very rarely. Rounding these numbers and adding a safety factor, values of 10,000,000 and 50,000,000 seem reasonable for the single geometry and the overall number of tiles per run. Memory use in osm2pgsql is about 32 bytes per tile, so this will need 1.6 GB max which should be no problem at all. The numbers are chosen so they will practically never be triggered so that users upgrading from existing versions of osm2pgsql will not be suddenly affected. It is recommended that users tune their settings according to their own needs. Once we have some more operational experience with this, we can adjust the defaults. I considered using different default max values for different zoom levels, but this will make configuration more complicated. Change file processing in osm2pgsql runs in parallel threads. The old code stored the to-be-expired tiles in one list per thread and merged them later. This has two problems: a) because the lists might contain some of the same tiles, all lists together can use a much larger amount than a single list would take b) we can not easily check the number of tiles in those lists against the configured maximum. So this commit changes the way the list is kept: We only keep a single list in the expire_output_t and use a mutex to control access to this list. (There might still be overlapping lists if you have more than one expire output, but that's by design.) Objects of expire_tiles_t class now only keep a temporary list for each geometry added. Once all tiles affected by a single geometry are identified, this list is added to the overall list in expire_output_t and the temporary list is cleared. Fixes osm2pgsql-dev#2190
af89bc8 to
45e330c
Compare
The number of tiles to be expired can be quite large if the input geometries are large or if there are many geometries. Numbers of tiles in the billions can crash osm2pgsql because it runs out of memory. Such large numbers can also overwhelm any kind of re-rendering mechanism run after osm2pgsql to bring tiles up to date. In day-to-day processing this should not happen, but it can happen due to vandalism or misconfiguration.
To protect against this problem, this change introduces limits on the number of tiles that can be affected by a single geometry and the overall number of tiles that an expire output will generate for each run of osm2pgsql.
max_tiles_geometrythis geometry will be ignored for the purposes of expiry. Note that the geometry will still be written to the database, but no tiles will be added to the expire output.max_tiles_overall, no further tiles will be written to this output.Limits are per expire output of which you can have several. The limits can be set in the flex expire output configuration but sensible defaults are provided. For the (legacy) expire output configured on the command line with the
-eand-ooptions, the settings can not be changed, you will always get the default values.To choose the default values for these settings I looked at real-world values as follows:
max_tiles_geometry.max_tiles_overall: Paul Norman analyzed the number of tiles expired by typical minutely updates in https://www.openstreetmap.org/user/pnorman/diary/403266. For zoom level 14 the most he got was 119801 tiles. The same analysis also shows that for longer time frames (checked were 2 minutes and 5 minutes, but the same should be true for larger intervals) the number of tiles doesn't go up because these huge numbers only happen very rarely.Rounding these numbers and adding a safety factor, values of 10,000,000 and 50,000,000 seem reasonable for the single geometry and the overall number of tiles per run. Memory use in osm2pgsql is about 32 bytes per tile, so this will need 1.6 GB max which should be no problem at all.
The numbers are chosen so they will practically never be triggered so that users upgrading from existing versions of osm2pgsql will not be suddenly affected. It is recommended that users tune their settings according to their own needs. Once we have some more operational experience with this, we can adjust the defaults.
I considered using different default max values for different zoom levels, but this will make configuration more complicated.
Change file processing in osm2pgsql runs in parallel threads. The old code stored the to-be-expired tiles in one list per thread and merged them later. This has two problems:
a) because the lists might contain some of the same tiles, all lists
together can use a much larger amount than a single list would take
b) we can not easily check the number of tiles in those lists against
the configured maximum.
So this commit changes the way the list is kept: We only keep a single list in the expire_output_t and use a mutex to control access to this list. (There might still be overlapping lists if you have more than one expire output, but that's by design.)
Objects of expire_tiles_t class now only keep a temporary list for each geometry added. Once all tiles affected by a single geometry are identified, this list is added to the overall list in expire_output_t and the temporary list is cleared.
Fixes #2190