Skip to content

CodeFrame is a fast, multi-language source code analyzer powered by Tree-sitter. It extracts types, methods, fields, and method calls across various languages (including Java, JS/TS, Python, C#, and PHP), streaming results as JSONL. Ships as a single self-contained JAR

License

Notifications You must be signed in to change notification settings

dxworks/codeframe

Repository files navigation

CodeFrame - Multi-Language Code Parser

A Tree-sitter-based code parser that extracts structural information from source files across multiple programming languages.

Supported Languages

  • Java (.java)
  • JavaScript (.js)
  • TypeScript (.ts)
  • Python (.py)
  • C# (.cs)
  • PHP (.php)
  • Ruby (.rb)
  • Rust (.rs)
  • SQL (.sql)
  • COBOL (.cbl, .cob, .cpy)
  • Markdown (.md, .markdown, .mkd, .mkdn, .mdwn, .mdown)

Features

For each supported language, CodeFrame extracts:

  • Type Information

    • Class/Interface declarations
    • Base classes (extends)
    • Implemented interfaces
  • Method Information

    • Method/Function names
    • Parameters
    • Local variables
    • Method calls with object context
  • File-Level Elements

    • Module/file-level constants and variables
    • Top-level function calls (outside any class/function)

Usage

Build the project

./gradlew build

Run analysis (two arguments required)

CodeFrame requires two arguments: <input-path> and <output-file>.

# Gradle
./gradlew run --args="<input-path> <output-file>"

# Direct JAR
java -jar codeframe.jar <input-path> <output-file>

Examples:

# Analyze a single file, write to codeframe-out/analysis.jsonl
./gradlew run --args="src/main/java/org/example/MyClass.java codeframe-out/analysis.jsonl"

# Analyze an entire directory
./gradlew run --args="src/main/java codeframe-out/analysis.jsonl"

# Analyze the entire project
./gradlew run --args=". codeframe-out/analysis.jsonl"

# Run directly via java
java -jar codeframe.jar src codeframe-out/analysis.jsonl

Docker

# Build
docker build -t codeframe-dev .

# Run (mount your code at /src)
docker run --rm -it -v "$PWD:/workspace" -v "/path/to/code:/src:ro" -w /workspace codeframe-dev

# Inside container
./gradlew run --args="/src /workspace/.out/analysis.jsonl"

Output

The analysis results are written to the path you pass as the second argument (e.g., /workspace/.out/analysis.jsonl) in JSONL format (JSON Lines - one JSON object per line). Parent directories for the output file are created automatically, and .out/ is gitignored by default.

Ignore patterns (.ignore)

  • Location: project root .ignore (included in releases).
  • Default contents:
    **node_modules**
    **.git**
    **.Designer.cs**
    **.Designer.vb**
    
  • Syntax:
    • Blank lines and lines starting with # are ignored.
    • Globs supported: * (within a segment), ** (across segments).
    • Paths are matched against normalized project paths relative to the input root.
  • Examples:
    • **node_modules** → ignore anything under any node_modules folder.
    • **.Designer.cs → ignore files ending with .Designer.cs anywhere.
    • src/generated/** → ignore everything under src/generated/.

How it works:

  • CodeFrame loads .ignore at startup using dx-ignore and filters files before analysis.
  • If .ignore is missing, no files are excluded by ignore rules.

Configuration (codeframe-config.yml)

CodeFrame supports optional configuration via a codeframe-config.yml file in the project root.

Available options:

Option Type Default Description
maxFileLines integer 20000 Maximum number of lines a file can have. Files exceeding this limit are skipped during analysis.
hideSqlTableColumns boolean false When true, SQL analysis output omits table column definitions for CREATE/ALTER TABLE operations.

| analyzers | map | all enabled | Enable/disable specific language analyzers. See Analyzer Configuration below. |

Analyzer Configuration:

You can selectively enable/disable analyzers using the analyzers map. All analyzers are enabled by default.

Available analyzer keys: java, javascript, typescript, python, csharp, php, sql, cobol, ruby, rust, markdown

Example configuration:

maxFileLines: 20000
hideSqlTableColumns: false
analyzers:
  java: true
  python: true
  sql: true

Behavior:

  • If codeframe-config.yml is missing, default values are used.
  • If the file exists but contains invalid YAML or missing/invalid values, defaults are applied silently.

Output Format

Output is JSONL (one JSON object per line) for memory efficiency and streaming.

Each line has a kind field:

  • "run" - Start metadata (timestamp, input path, file count)
  • File analysis objects (one per file)
  • "error" - Parse errors (if any)
  • "done" - Completion metadata (duration, counts)

Example outputs: See approved test outputs for real analysis results, e.g.:

SQL Analysis

SQL file analysis uses a hybrid JSqlParser + ANTLR approach to support multiple dialects (PostgreSQL, MySQL, T-SQL, PL/SQL) without configuration.

For complete documentation on SQL support, see SQL_SPEC.md.

COBOL Analysis

COBOL file analysis extracts structural information including:

  • Program identification and metadata
  • File control entries and data definitions
  • Data items and variables
  • Sections and paragraphs
  • Copy book statements
  • Embedded SQL/CICS/IMS detection

For complete documentation on COBOL support, see COBOL_SPEC.md.

Markdown Analysis

Markdown file analysis extracts document structure including:

  • Preamble and heading hierarchy
  • Block-level elements (paragraphs, code blocks, tables, lists, block quotes, thematic breaks, HTML blocks, images)
  • Line spans for extracted elements

For complete documentation on Markdown support, see MARKDOWN_SPEC.md.

Architecture

See ARCHITECTURE.md for details on core components and design decisions.

Contributing

See CONTRIBUTING.md for guidelines on adding new languages and analyzer conventions.

Requirements

  • Java 17+
  • Gradle 8.x
  • No native toolchain required (Tree-sitter natives are bundled via Maven artifacts)

License

This project uses Tree-sitter and its language grammars, which are licensed under MIT.

Limitations

General

  • Nested functions (e.g., arrow functions inside other functions, decorator wrappers) are NOT extracted as separate methods. Instead, their calls and local variables are captured in the parent function. This prevents duplicate function names and maintains correct semantic grouping
  • Parameter modifiers (e.g., final, ref, out) are not captured
  • Constructor calls (new ...) are not captured in methodCalls
  • Loop header variables are not added to localVariables

Language-Specific

  • JavaScript/TypeScript: Destructured parameters emit leaf names only; dynamic imports ignored; constructor functions appear as methods; object literal methods (e.g., const obj = { method() {} }) are not extracted - only the containing variable is captured as a field; functions inside IIFEs are extracted as top-level methods (their enclosing scope is not tracked)
  • C#: Events not handled; see test samples for details
  • Java: Local/anonymous classes not extracted as separate types
  • Python: Type aliases using TypeAlias annotation are captured with kind: "type_alias". PEP 695 style (type X = ...) is not yet supported by the tree-sitter-python grammar
  • SQL: See SQL_SPEC.md
  • COBOL: See COBOL_SPEC.md
  • Markdown: Front matter is ignored in output; links and inline formatting are not emitted as dedicated elements. See MARKDOWN_SPEC.md

Testing

./gradlew test

See CONTRIBUTING.md for testing workflow and conventions.

About

CodeFrame is a fast, multi-language source code analyzer powered by Tree-sitter. It extracts types, methods, fields, and method calls across various languages (including Java, JS/TS, Python, C#, and PHP), streaming results as JSONL. Ships as a single self-contained JAR

Resources

License

Contributing

Stars

Watchers

Forks

Packages