Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
5b34035
Remove v1 result file code paths
mawelborn Jun 4, 2025
052c37f
Remove v1 code paths and load tables automatically when available
mawelborn Jun 6, 2025
3e9403b
Update unit tests with 7.2 ETL Output JSON files
mawelborn Jun 6, 2025
f5f449d
Update docstrings
mawelborn Jun 6, 2025
bb863bf
Add more unit tests
mawelborn Jun 6, 2025
0cbc302
Remove pre-7.2 normalization and refactor normalization
mawelborn Jun 4, 2025
f84c069
Rename `model_sections` and `component_sections` to `*_ids`
mawelborn Jun 4, 2025
88f025a
Rename `ModelGroup` and `ModelGroupType` to `Task` and `TaskType`
mawelborn Jun 4, 2025
75c4005
Refactor `Result.from_dict()`
mawelborn Jun 4, 2025
edd6d1c
Add `Box.__bool__()` to match `Span` and `Citation`
mawelborn Jun 6, 2025
24123cb
Add `Summarization.spans` property
mawelborn Jun 6, 2025
374f0ad
Update prediction docstrings and simplify a few statements
mawelborn Jun 6, 2025
6c7196a
Add more unit tests
mawelborn Jun 6, 2025
3230abc
Explicitly handle null spans and citations
mawelborn Jun 9, 2025
a2417da
Fix semantic error in setting form extraction checkbox text
mawelborn Jun 9, 2025
f7c5242
Filter form extractions by type for `.where(checked=...)` and `.where…
mawelborn Jun 9, 2025
2123023
Remove `Result.version` as it's no longer needed to serialize
mawelborn Jun 10, 2025
1ceb7ca
Improve variable names for `PredictionList.groupby()` and `.groupbyit…
mawelborn Jun 10, 2025
58b1581
Improve docstring and fix predicate order of `PredictionList.where()`
mawelborn Jun 10, 2025
2570bca
Make `NULL_CITATION` initializer consistent with the dataclass defini…
mawelborn Jun 10, 2025
75966cc
Make `Citation.to_dict()` key order consistent with the dataclass def…
mawelborn Jun 10, 2025
2da72f3
Update `normalize_prediction_dict()` comment
mawelborn Jun 10, 2025
d2af0a1
Improve variable names
mawelborn Jun 11, 2025
bc3182a
Move return out of try..except
mawelborn Jun 12, 2025
c37a6f8
Fix importing `ResultError` as a type
mawelborn Jun 16, 2025
d51ee96
Improve comments
mawelborn Jun 16, 2025
45bac96
Merge pull request #216 from IndicoDataSolutions/mawelborn/results-7-2
mawelborn Jun 16, 2025
1feb492
Merge pull request #217 from IndicoDataSolutions/mawelborn/etloutput-7-2
mawelborn Jun 16, 2025
f16c925
Unify Result and EtlOutput loading APIs
mawelborn Jun 16, 2025
9d86247
Update changelog and version
mawelborn Jun 17, 2025
ae6205e
Rewrite changelog to follow keepachangelog.com
mawelborn Jun 17, 2025
0ec8e7c
Include lists of tokens/tables in ETL output loadable types
mawelborn Jun 18, 2025
8715fe3
Merge pull request #218 from IndicoDataSolutions/mawelborn/dataclass-…
mawelborn Jun 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
265 changes: 179 additions & 86 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,168 +1,261 @@
# Changelog

## 1.0.1 - 6/2/2021
All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and versions match the minimum IPA version required to use functionality.


## [v7.2.0] - 2025-06-17

### Added

* Added Snapshot merging / manipulation
* Class for highlighting extractions onto source PDFs and adding table of contents.
- Unified support for `dict`, JSON `str`, and JSON `bytes` as loadable types in
`results.load()`, `results.load_async()`, `etloutput.load()`, and
`etloutput.load_async()`.

### Fixed
### Changed

* Row Association now also sorting on 'bbtop'
- Rename `model`, `ModelGroup`, and `ModelGroupType` to `task`, `Task`, and `TaskType`
in the `results` module.
- Table OCR is automatically loaded when present
(`AutoReviewPoller(..., load_tables=True)` and `etloutput.load(..., tables=True)` are
now the default).

## 1.0.2 6/15/2021
### Removed

- v1 result file support and code paths in the `results` module.
- v1 ETL output support and code paths in the `etloutput` module.
- IPA 6.X support and edge cases in the `results` and `etloutput` modules.


## [v6.14.2] - 2025-05-08

### Added

* PDF manipulation features
* Support for classification predictions
- Support for imported models using IPA 7.2 `component_metadata` section.
- Parse and preserve full span information for `Unbundling` predictions.
- `group = next(group)` idiom.

### Removed

- `AutoPopulator`, `CustomOcr`, `Datasets`, `DocExtraction`, `Reviewer` classes.

### Fixed

* Dependency installation
- Mypy configuration.


## [v6.14.1] - 2025-03-20

## 1.0.3 8/16/2021
### Changed

- Improve Poetry and Poe configuration.
- Update more attributes when prediction text changes to avoid TAK normalization issues.


## [v6.14.0] - 2025-03-10

### Added

* Find from questionnaire ID added to finder class.
* ModelGroupPredict support added.
* Added module to get metrics for all models in a group.
* Multi color highlighting and annotations added for PDF highlighting.
* Added stagger dataset upload feature for large doc datasets.
* Added default retry functionality for certain API calls.
* Added additional snapshot features.
- `results` module.
- `etloutput` module.
- Async coroutine support in the `retry` decorator.

### Fixed
### Changed

* Wait kwarg added to submit review method.
* Better support for dataset creation / adding files to teach tasks.
- Switch to Poetry for packaging and dependency management.

## 1.0.5 9/21/2021

## v6.1.0 - 2024-05-06

### Removed

- Staggered loop support.
- Highlighting support.


## v6.0.1 - 2023-11-22

### Added

- Original filename to the workflow result object.


## v6.0.0 - 2023-10-30

This is the first major version release tested to work on Indico 6.X.

### Added

* Classes for model comparison and model improvement.
* Plotting functionality for model comparison.
- Support for unbundling metrics.
- `Structure` class to support building out workflows, datasets, teach tasks, and
copying workflows.

### Changed

- Refactor `AutoReview` to simplify setup.
- Replace `AutoClassifier` with `AutoPopulator` to make on-document classification
model training simple. This class also includes a "copy_teach_task" method that is a
frequently needed standalone method.
- Simplify `StaggeredLoop` implementation to inject labeled samples into a development
workflow (deprecated previous version).


## 1.0.7 11/9/2021
## v2.0.2 - 2022-08-31

### Added

* Positioning class added to assist in relative prediction location validation
* Added # of samples labeled to metrics class.
- Support for staggered looped learning.
- Ground truth compare feature to compare a snapshot against model predictions and
receive analytics.

### Changed

- Upgrade client to 5.1.4.
- Modify `IndioWrapper` class to work with Indico 5.x.
- Update `Snapshot` class to account for updated target spans.
- Update Add Model calls to align with 5.1.4 components.


## v2.0.1 - 2022-05-20

### Added

- Feature in `FileProcessing` class to read and return file as JSON.
- Feature in `Highlighter` class to redact and replace highlights with spoofed data.
- `Download` class to support downloading resources from an Indico Cluster.

### Changed

- Upgrade client to 5.1.3.
- Update SDK calls for Indico 5.x compatibility.

### Removed

* Teach classes in indico_wrapper
- `FindRelated` class in `indico_wrapper`.


## 1.0.8 11/15/2021
## v1.2.2 - 2022-03-03

### Added

* New line plot for number of samples in metrics class.
* Update to highlighting class with new flexibility and bookmarks replacing table of contents.
- Feature in `Positioning` class to calculate overlap between two bounding boxes on the
same page.

### Changed

- Update metrics plot to order ascending based on latest model.

### Fixed

- Optional dependencies to support M1 installation.


## 1.1.1 12/6/2021
## v1.2.0 - 2022-01-06

### Added

* Abillity to include metadata with highlighter
* Ability to split large snapshots into smaller files
- Distance measurements in the prediction `Positioning` class.
- Features on the `Extractions` class:
- predictions that are removed by any method are saved in an attribute if they're
needed for logs, etc.; get all text values for a particular label,
- get most common text value for a particular label.
- Better exception handling for workflow submissions and more flexibility on format of
what is returned (allows custom response JSON to avoid the `WorkflowResult` class).

## 1.1.2 12/6/2021

## v1.1.2 - 2021-12-06

### Added

* Updated functionality for large dataset creation. Batch options allow for more reliable dataset uploads.
- Update functionality for large dataset creation.
- Batch options allow for more reliable dataset uploads.


## 1.2 1/6/2022
## v1.1.1 - 2021-12-06

### Added

* New distance measurements in the prediction Positioning class.
* New Features on the Extractions class: predictions that are removed by any method are saved in an
attribute if they're needed for logs, etc.; get all text values for a particular label; get most
common text value for a particular label.
* Better exception handling for Workflow submissions and more flexibility on format of what is returned
(allows custom response jsons to avoid the WorkflowResult class).
- Ability to include metadata with highlighter.
- Ability to split large snapshots into smaller files.

## 1.2.2 3/03/2022

## v1.0.8 - 2021-11-15

### Added

* Updated metrics plot to order ascending based on latest model
* New feature in Positioning class to calculate overlap between two bounding boxes on the same page
- Line plot for number of samples in metrics class.

### Fixed
### Changed

* Optional dependencies to support M1 installation
- Update `Highlighting` class with new flexibility and bookmarks replacing table of
contents.

## 2.0.1 5/20/2022

## v1.0.7 - 2021-11-09

### Added

* New feature in FileProcessing class to read and return file as json
* New feature in Highlighter class to redact and replace highlights with spoofed data
* New Download class to support downloading resources from an Indico Cluster
* Upgrades client to 5.1.3 and upgrades SDK calls for Indico 5.x compatibility
- `Positioning` class to assist in relative prediction location validation.
- Number of samples labeled to metrics class.

### Removed

* FindRelated class in indico_wrapper
- Teach classes in `indico_wrapper`.


## 2.0.2 8/31/2022
## v1.0.5 - 2021-09-21

### Added

* Upgrades client to 5.1.4
* New feature to now support staggered looped learning
* Ground truth compare feature to compare a snapshot against model predictions and receive analytics
* Modifies IndioWrapper class to updated CreateModelGroup call to work with Indico 5.x
* Updates Snapshot class to account for updated target spans
* Updates Add Model calls to aligh with 5.1.4 components
- Classes for model comparison and model improvement.
- Plotting functionality for model comparison.

## 6.0 10/30/23

This is the first major version release tested to work on Indico 6.X.
## v1.0.3 - 2021-08-16

### Added

* Refactored AutoReview to simplify setup.
* Replaced AutoClassifier with AutoPopulator to make ondoc classification model training simple. This class also includes a "copy_teach_task" method that is a frequently needed standalone method.
* Simplified a StaggeredLoop implementation to inject labeled samples into a dev workflow (deprecated previous version).
* Added support for unbundling metrics.
* Added the `Strucure` class to support building out workflows, datasets, teach tasks. As well as to support copying workflows.
- Find from questionnaire ID added to finder class.
- ModelGroupPredict support.
- Module to get metrics for all models in a group.
- Multi color highlighting and annotations for PDF highlighting.
- Staggered dataset upload feature for large doc datasets.
- Default retry functionality for certain API calls.
- Additional snapshot features.

### Fixed

- `wait` keyword argument added to submit review method.
- Better support for dataset creation / adding files to teach tasks.

## 6.0.1 11/22/23

## v1.0.2 - 2021-06-15

### Added

* Small but important fix to add original filename to the workflow result object
- PDF manipulation features.
- Support for classification predictions.

### Fixed

## 6.1.0 5/6/24
- Dependency installation.

### Removed

* Removed staggered loop support and removed highlighting support.
## v1.0.1 - 2021-06-02

## 6.14.0 3/10/25
### Added

* Added `results` module.
* Added `etloutput` module.
* Refactored `retry` decorator with asyncio support.
* Switched to Poetry for packaging and dependency management.
- Snapshot merging / manipulation.
- Class for highlighting extractions onto source PDFs and adding table of contents.

## 6.14.1 3/20/25
### Fixed

* Improved Poetry and Poe configuration.
* Update more attributes when prediction text changes to avoid TAK normalization issues.
- Row Association now also sorting on 'bbtop'.

## 6.14.2 5/8/25

* Fixed Mypy configuration.
* Removed `AutoPopulator`, `CustomOcr`, `Datasets`, `DocExtraction`, `Reviewer` classes.
* Added support for imported models using IPA 7.2 `component_metadata` section.
* Parse and preserve full span information for `Unbundling` predictions.
* Add `group = next(group)` idiom.
[v7.2.0]: https://github.com/IndicoDataSolutions/zapper/compare/v6.14.2...v7.2.0
[v6.14.2]: https://github.com/IndicoDataSolutions/zapper/compare/v6.14.1...v6.14.2
[v6.14.1]: https://github.com/IndicoDataSolutions/zapper/compare/v6.14.0...v6.14.1
[v6.14.0]: https://github.com/IndicoDataSolutions/zapper/tree/v6.14.0
4 changes: 2 additions & 2 deletions examples/results_autoreview.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ def autoreview(result: results.Result) -> Any:
pre_review = result.pre_review
extractions = pre_review.extractions

# Downselect all labels from all models based on highest confidence.
for model, extractions in extractions.groupby(attrgetter("model")).items():
# Downselect all labels from all tasks based on highest confidence.
for task, extractions in extractions.groupby(attrgetter("task")).items():
for label, extractions in extractions.groupby(attrgetter("label")).items():
# Order extractions by confidence descending.
ordered = extractions.orderby(attrgetter("confidence"), reverse=True)
Expand Down
Loading