Refseq annotation by alinakbase · Pull Request #71 · kbase/cdm-data-loader-utils

alinakbase · 2026-01-21T18:19:55Z

No description provided.

src/cdm_data_loader_utils/parsers/annotation_parse.py

src/cdm_data_loader_utils/parsers/uniprot.py

tests/parsers/test_annotation_parse.py

ialarmedalien · 2026-01-21T20:30:56Z

src/cdm_data_loader_utils/parsers/annotation_parse.py

+    if identifier.startswith("GCF_"):
+        return f"insdc.gcf:{identifier}"


let's also add in

if identifier.startswith("GCA_"): return f"insdc.gca:{identifier}"

ialarmedalien · 2026-01-22T21:58:35Z

tests/parsers/test_annotation_parse.py

@@ -0,0 +1,710 @@
+import json


Before you start doing any refactoring, can you add in an integration test that checks the results of parsing the JSON data into all 8 CDM tables? Let me know when you have done that so I can take a look.

tests/validation/assertions.py

ialarmedalien · 2026-01-28T21:50:29Z

tests/parsers/test_annotation_parse.py

+    expected_tables = [
+        "contig",
+        "contig_x_contigcollection",
+        "contigcollection_x_feature",
+        "contigcollection_x_protein",
+        "feature",
+        "feature_x_protein",
+        "identifier",
+        "name",
+    ]


You also need the protein table -- it looks like the parser is not capturing the protein information any more.

tests/parsers/test_annotation_parse.py

ialarmedalien · 2026-01-29T21:30:57Z

tests/parsers/refseq/api/test_annotation_report.py

+    # Load NCBI dataset from NCBI API
+    sample_api_response = test_data_dir / "refseq" / "annotation_report.json"
+    dataset = json.load(sample_api_response.open())
+
+    # Run parse function
+    parse_annotation_data(spark, [dataset], TEST_NS)


You need to load the annotation_report.parsed.json file here and use that to populated expected_df.

codecov · 2026-01-29T21:35:03Z

Codecov Report

❌ Patch coverage is 37.91209% with 339 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.56%. Comparing base (e6b3b2f) to head (cb97303).
⚠️ Report is 1 commits behind head on develop.

Files with missing lines	Patch %	Lines
...ader_utils/parsers/refseq/api/annotation_report.py	0.00%	273 Missing ⚠️
...ader_utils/parsers/refseq/api/update_annotation.py	75.64%	66 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop      #71      +/-   ##
===========================================
- Coverage    52.69%   50.56%   -2.14%     
===========================================
  Files           63       66       +3     
  Lines         3241     3787     +546     
===========================================
+ Hits          1708     1915     +207     
- Misses        1533     1872     +339

Files with missing lines	Coverage Δ
...rc/cdm_data_loader_utils/model/kbase_cdm_schema.py	`100.00% <100.00%> (ø)`
...ader_utils/parsers/refseq/api/update_annotation.py	`75.64% <75.64%> (ø)`
...ader_utils/parsers/refseq/api/annotation_report.py	`0.00% <0.00%> (ø)`

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5fe75e8...cb97303. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tests/parsers/refseq/api/test_annotation_report.py

src/cdm_data_loader_utils/parsers/refseq/api/annotation_report.py

tests/parsers/refseq/api/test_annotation_report.py

+
+# from pyspark.testing import assertDataFrameEqual, assertSchemaEqual
+
+from tests.helpers import assertDataFrameEqual


src/cdm_data_loader_utils/parsers/refseq/api/update_annotation.py

src/cdm_data_loader_utils/parsers/refseq/api/annotation_report.py

+
+from delta import configure_spark_with_delta_pip
+from pyspark.sql import SparkSession
+from pyspark.sql.types import StructType, StructField, StringType


github-code-quality bot found potential problems Jan 21, 2026

View reviewed changes

src/cdm_data_loader_utils/parsers/annotation_parse.py Fixed Show fixed Hide fixed

src/cdm_data_loader_utils/parsers/uniprot.py Fixed Show fixed Hide fixed

github-code-quality bot found potential problems Jan 21, 2026

View reviewed changes

tests/parsers/test_annotation_parse.py Fixed Show fixed Hide fixed

ialarmedalien changed the base branch from develop to uniprot-refactor-v2 January 21, 2026 19:07

ialarmedalien reviewed Jan 21, 2026

View reviewed changes

ialarmedalien reviewed Jan 22, 2026

View reviewed changes

ialarmedalien force-pushed the refseq-annotation branch from 16aa4cf to eab27c9 Compare January 28, 2026 01:13

ialarmedalien force-pushed the uniprot-refactor-v2 branch 2 times, most recently from 06c5508 to a84db46 Compare January 28, 2026 17:21

ialarmedalien reviewed Jan 28, 2026

View reviewed changes

tests/validation/assertions.py Outdated Show resolved Hide resolved

ialarmedalien reviewed Jan 28, 2026

View reviewed changes

ialarmedalien reviewed Jan 29, 2026

View reviewed changes

tests/parsers/test_annotation_parse.py Outdated Show resolved Hide resolved

ialarmedalien force-pushed the refseq-annotation branch from 5d4a64c to 0301f1b Compare January 29, 2026 21:21

ialarmedalien changed the base branch from uniprot-refactor-v2 to develop January 29, 2026 21:22

ialarmedalien reviewed Jan 29, 2026

View reviewed changes

github-code-quality bot found potential problems Jan 30, 2026

View reviewed changes

tests/parsers/refseq/api/test_annotation_report.py Fixed Show fixed Hide fixed

ialarmedalien force-pushed the refseq-annotation branch from 9b5e4d0 to 6039dc6 Compare January 30, 2026 16:55

ialarmedalien and others added 13 commits February 4, 2026 18:50

First pass at RefSeq API annotation report endpoint parser

f36ac1c

Restoring deleted test

584cc66

Add extra schemas

352e184

convert pyproject.toml and uv.lock

7b04d84

change function

7efd599

add SOP document

348039d

run ruff format

9dbb64e

run ruff format

6002124

modify the identifier function in report and modify the teat report

1798345

reformat the annotation report

451a9b9

add test-annotation-report.py

51d0a12

change test report and run_tests.sh

c6f62a0

Fixing various files that did not need to be edited

fe8695b

ialarmedalien force-pushed the refseq-annotation branch from eba4b50 to fe8695b Compare February 5, 2026 02:53

change contig collection

786420b

github-code-quality bot found potential problems Feb 6, 2026

View reviewed changes

src/cdm_data_loader_utils/parsers/refseq/api/annotation_report.py Fixed Show fixed Hide fixed

tests/parsers/refseq/api/test_annotation_report.py

# from pyspark.testing import assertDataFrameEqual, assertSchemaEqual

from tests.helpers import assertDataFrameEqual

alinakbase added 2 commits February 5, 2026 17:11

formatting

b1edda9

Add new annotation report

c3cd906

github-code-quality bot found potential problems Feb 7, 2026

View reviewed changes

src/cdm_data_loader_utils/parsers/refseq/api/update_annotation.py Fixed Show fixed Hide fixed

alinakbase added 3 commits February 10, 2026 14:39

update the annotation parser and test script

5fea7b7

update the ruff format

cb97303

reorganize the refseq annotation

d957ecc

github-code-quality bot found potential problems Feb 13, 2026

View reviewed changes

src/cdm_data_loader_utils/parsers/refseq/api/annotation_report.py

from delta import configure_spark_with_delta_pip

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, StringType

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refseq annotation#71

Refseq annotation#71
alinakbase wants to merge 19 commits intodevelopfrom
refseq-annotation

alinakbase commented Jan 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ialarmedalien Jan 21, 2026

Uh oh!

ialarmedalien Jan 22, 2026

Uh oh!

Uh oh!

ialarmedalien Jan 28, 2026

Uh oh!

Uh oh!

ialarmedalien Jan 29, 2026

Uh oh!

codecov bot commented Jan 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if identifier.startswith("GCF_"):
		return f"insdc.gcf:{identifier}"


		# from pyspark.testing import assertDataFrameEqual, assertSchemaEqual

		from tests.helpers import assertDataFrameEqual

Conversation

alinakbase commented Jan 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ialarmedalien Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

ialarmedalien Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ialarmedalien Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ialarmedalien Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Jan 29, 2026 •

edited

Loading