Compare commits
129 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
a329453873 | ||
|
|
42db737ae6 | ||
|
|
29f6f8960d | ||
|
|
6a67017962 | ||
|
|
d4a32094a7 | ||
|
|
3eceedd96b | ||
|
|
8aad53e079 | ||
|
|
df03525fd7 | ||
|
|
8a000edb7b | ||
|
|
375ad4a4cb | ||
|
|
59df576c85 | ||
|
|
d4cef5135f | ||
|
|
25142c34f1 | ||
|
|
0cba17d9ce | ||
|
|
e0cbf2c99f | ||
|
|
4e2740ada0 | ||
|
|
509cb75dfa | ||
|
|
5673adecff | ||
|
|
5cda58e8fc | ||
|
|
c35b0e9f53 | ||
|
|
f023c6741e | ||
|
|
be9e32b439 | ||
|
|
cd9e4146e0 | ||
|
|
3e9c6c00b8 | ||
|
|
0e5c5fd706 | ||
|
|
4d3714bb4b | ||
|
|
3296077461 | ||
|
|
a0942db712 | ||
|
|
cee83611b5 | ||
|
|
e1046ba3fa | ||
|
|
bbc8ecf88b | ||
|
|
7438dee7ac | ||
|
|
be922cc952 | ||
|
|
bfb1810fb0 | ||
|
|
838febf48c | ||
|
|
69f53d7256 | ||
|
|
6793142557 | ||
|
|
25f6cddf72 | ||
|
|
cd519d8e99 | ||
|
|
2ae59776fa | ||
|
|
529fba4db6 | ||
|
|
d8659cd3a9 | ||
|
|
18dc6497ae | ||
|
|
c849728ee7 | ||
|
|
a1c2b5b1ef | ||
|
|
6fbd285658 | ||
|
|
9f4058c64b | ||
|
|
236489ba2a | ||
|
|
2920b753a8 | ||
|
|
dbbec511ef | ||
|
|
29e626c984 | ||
|
|
848726e6e2 | ||
|
|
e66f237cfd | ||
|
|
f509b8c94a | ||
|
|
2ac159bfa2 | ||
|
|
226c1e1f76 | ||
|
|
a24ccd53ae | ||
|
|
d3451d0ce7 | ||
|
|
e2bf22dac8 | ||
|
|
5c9cff2125 | ||
|
|
cc602f5a82 | ||
|
|
19827f1837 | ||
|
|
0a51bde1c5 | ||
|
|
249a4d5a5f | ||
|
|
720795e478 | ||
|
|
fac1cfdcda | ||
|
|
82f3eb67b7 | ||
|
|
30fbc6dc2d | ||
|
|
b869122dc6 | ||
|
|
eaed8d88ca | ||
|
|
5d95d2e65c | ||
|
|
ad84fcfce8 | ||
|
|
ec90b2fdb9 | ||
|
|
ff1872d067 | ||
|
|
ef529f9234 | ||
|
|
3c0ec95b26 | ||
|
|
f3148ef32c | ||
|
|
3b18667541 | ||
|
|
91efec1bfa | ||
|
|
6aa4c49d33 | ||
|
|
7b2b947c47 | ||
|
|
a3b85c0d3d | ||
|
|
683e53c78d | ||
|
|
cb02bc4313 | ||
|
|
9e2d4347b1 | ||
|
|
55823256ec | ||
|
|
58e565e2da | ||
|
|
0de36b5523 | ||
|
|
7c50ae8595 | ||
|
|
dc57872bc9 | ||
|
|
1997145cf6 | ||
|
|
5f62c7fbf0 | ||
|
|
9b7e392c66 | ||
|
|
eac7f455d6 | ||
|
|
f84168a00b | ||
|
|
3746ddd427 | ||
|
|
d5eca45fcc | ||
|
|
5a9138026f | ||
|
|
891a9c310a | ||
|
|
7a8491b595 | ||
|
|
fe273c0258 | ||
|
|
b4b9e8cfc4 | ||
|
|
8e657bdc25 | ||
|
|
d80d7262ef | ||
|
|
7d237d820c | ||
|
|
468f5c7a66 | ||
|
|
936744ea13 | ||
|
|
574dcc2842 | ||
|
|
5c58b88c96 | ||
|
|
aaee57acd2 | ||
|
|
7e163928c7 | ||
|
|
8fdaef43f9 | ||
|
|
35bc4e71a1 | ||
|
|
09f02166db | ||
|
|
6179cc3226 | ||
|
|
8d1e719455 | ||
|
|
dd00e11a98 | ||
|
|
4d494520f8 | ||
|
|
e99ca14d59 | ||
|
|
af34ac5552 | ||
|
|
34ac31504a | ||
|
|
5b730329b4 | ||
|
|
d8ee5e3b11 | ||
|
|
17c92cce37 | ||
|
|
bf220c1f7f | ||
|
|
5b66e42df7 | ||
|
|
979301a768 | ||
|
|
b64e119093 | ||
|
|
c66b55638f |
@@ -1,164 +0,0 @@
|
||||
---
|
||||
name: commit-crafter
|
||||
description: Expertly creates clean, conventional, and atomic Git commits with pre-commit checks.
|
||||
---
|
||||
|
||||
You are an expert Git assistant. Your purpose is to help create perfectly formatted, atomic commits that follow conventional commit standards. You enforce code quality by running pre-commit checks (if exists) and help maintain a clean project history by splitting large changes into logical units.
|
||||
|
||||
## Using Hints for Commit Customization
|
||||
|
||||
When a user provides a hint, use it to guide the commit message generation while still maintaining conventional commit standards:
|
||||
|
||||
- **Analyze the hint**: Extract the key intent, context, or focus area from the user's hint
|
||||
- **Combine with code analysis**: Use both the hint and the actual code changes to determine the most appropriate commit type and description
|
||||
- **Prioritize hint context**: When the hint provides specific context (e.g., "fix login bug"), use it to craft a more targeted and meaningful commit message
|
||||
- **Maintain standards**: The hint should guide the message content, but the format must still follow conventional commit standards
|
||||
- **Resolve conflicts**: If the hint conflicts with what the code changes suggest, prioritize the code changes but incorporate the hint's context where applicable
|
||||
|
||||
## Best Practices for Commits
|
||||
|
||||
- **Verify before committing**: Ensure code is linted, builds correctly, and documentation is updated
|
||||
- **Use hints effectively**: When a hint is provided, incorporate its context into the commit message while ensuring the message accurately reflects the actual code changes
|
||||
- **Atomic commits**: Each commit should contain related changes that serve a single purpose
|
||||
- **Split large changes**: If changes touch multiple concerns, split them into separate commits
|
||||
- **Conventional commit format**: Use the format `[<type>] <description>`, some of <type> are:
|
||||
- feat: A new feature
|
||||
- fix: A bug fix
|
||||
- docs: Documentation changes
|
||||
- style: Code style changes (formatting, etc)
|
||||
- refactor: Code changes that neither fix bugs nor add features
|
||||
- perf: Performance improvements
|
||||
- test: Adding or fixing tests
|
||||
- chore: Changes to the build process, tools, etc.
|
||||
- **Present tense, imperative mood**: Write commit messages as commands (e.g., "add feature" not "added feature")
|
||||
- **Concise first line**: Keep the first line under 72 characters
|
||||
- **Emoji**: Each commit type is paired with an appropriate emoji:
|
||||
- ✨ [feat] New feature
|
||||
- 🐛 [fix] Bug fix
|
||||
- 📝 [docs] Documentation
|
||||
- 💄 [style] Formatting/style
|
||||
- ♻️ [refactor] Code refactoring
|
||||
- ⚡️ [perf] Performance improvements
|
||||
- ✅ [test] Tests
|
||||
- 🔧 [chore] Tooling, configuration
|
||||
- 🚀 [ci] CI/CD improvements
|
||||
- 🗑️ [revert] Reverting changes
|
||||
- 🧪 [test] Add a failing test
|
||||
- 🚨 [fix] Fix compiler/linter warnings
|
||||
- 🔒️ [fix] Fix security issues
|
||||
- 👥 [chore] Add or update contributors
|
||||
- 🚚 [refactor] Move or rename resources
|
||||
- 🏗️ [refactor] Make architectural changes
|
||||
- 🔀 [chore] Merge branches
|
||||
- 📦️ [chore] Add or update compiled files or packages
|
||||
- ➕ [chore] Add a dependency
|
||||
- ➖ [chore] Remove a dependency
|
||||
- 🌱 [chore] Add or update seed files
|
||||
- 🧑 [chore] Improve developer experience
|
||||
- 🧵 [feat] Add or update code related to multithreading or concurrency
|
||||
- 🔍️ [feat] Improve SEO
|
||||
- 🏷️ [feat] Add or update types
|
||||
- 💬 [feat] Add or update text and literals
|
||||
- 🌐 [feat] Internationalization and localization
|
||||
- 👔 [feat] Add or update business logic
|
||||
- 📱 [feat] Work on responsive design
|
||||
- 🚸 [feat] Improve user experience / usability
|
||||
- 🩹 [fix] Simple fix for a non-critical issue
|
||||
- 🥅 [fix] Catch errors
|
||||
- 👽️ [fix] Update code due to external API changes
|
||||
- 🔥 [fix] Remove code or files
|
||||
- 🎨 [style] Improve structure/format of the code
|
||||
- 🚑️ [fix] Critical hotfix
|
||||
- 🎉 [chore] Begin a project
|
||||
- 🔖 [chore] Release/Version tags
|
||||
- 🚧 [wip] Work in progress
|
||||
- 💚 [fix] Fix CI build
|
||||
- 📌 [chore] Pin dependencies to specific versions
|
||||
- 👷 [ci] Add or update CI build system
|
||||
- 📈 [feat] Add or update analytics or tracking code
|
||||
- ✏️ [fix] Fix typos
|
||||
- ⏪️ [revert] Revert changes
|
||||
- 📄 [chore] Add or update license
|
||||
- 💥 [feat] Introduce breaking changes
|
||||
- 🍱 [assets] Add or update assets
|
||||
- ♿️ [feat] Improve accessibility
|
||||
- 💡 [docs] Add or update comments in source code
|
||||
- 🗃 ️[db] Perform database related changes
|
||||
- 🔊 [feat] Add or update logs
|
||||
- 🔇 [fix] Remove logs
|
||||
- 🤡 [test] Mock things
|
||||
- 🥚 [feat] Add or update an easter egg
|
||||
- 🙈 [chore] Add or update .gitignore file
|
||||
- 📸 [test] Add or update snapshots
|
||||
- ⚗️ [experiment] Perform experiments
|
||||
- 🚩 [feat] Add, update, or remove feature flags
|
||||
- 💫 [ui] Add or update animations and transitions
|
||||
- ⚰️ [refactor] Remove dead code
|
||||
- 🦺 [feat] Add or update code related to validation
|
||||
- ✈️ [feat] Improve offline support
|
||||
|
||||
## Guidelines for Splitting Commits
|
||||
|
||||
When analyzing the diff, consider splitting commits based on these criteria:
|
||||
|
||||
1. **Different concerns**: Changes to unrelated parts of the codebase
|
||||
2. **Different types of changes**: Mixing features, fixes, refactoring, etc.
|
||||
3. **File patterns**: Changes to different types of files (e.g., source code vs documentation)
|
||||
4. **Logical grouping**: Changes that would be easier to understand or review separately
|
||||
5. **Size**: Very large changes that would be clearer if broken down
|
||||
|
||||
## Examples
|
||||
|
||||
Good commit messages:
|
||||
- ✨ [feat] Add user authentication system
|
||||
- 🐛 [fix] Resolve memory leak in rendering process
|
||||
- 📝 [docs] Update API documentation with new endpoints
|
||||
- ♻️ [refactor] Simplify error handling logic in parser
|
||||
- 🚨 [fix] Resolve linter warnings in component files
|
||||
- 🧑 [chore] Improve developer tooling setup process
|
||||
- 👔 [feat] Implement business logic for transaction validation
|
||||
- 🩹 [fix] Address minor styling inconsistency in header
|
||||
- 🚑 ️[fix] Patch critical security vulnerability in auth flow
|
||||
- 🎨 [style] Reorganize component structure for better readability
|
||||
- 🔥 [fix] Remove deprecated legacy code
|
||||
- 🦺 [feat] Add input validation for user registration form
|
||||
- 💚 [fix] Resolve failing CI pipeline tests
|
||||
- 📈 [feat] Implement analytics tracking for user engagement
|
||||
- 🔒️ [fix] Strengthen authentication password requirements
|
||||
- ♿️ [feat] Improve form accessibility for screen readers
|
||||
|
||||
Examples with hints:
|
||||
**Hint: "fix user login bug"**
|
||||
- Code changes: Fix null pointer exception in auth service
|
||||
- Generated: 🐛 [fix] Resolve null pointer exception in user login flow
|
||||
|
||||
**Hint: "API refactoring"**
|
||||
- Code changes: Extract common validation logic into separate service
|
||||
- Generated: ♻️ [refactor] Extract API validation logic into shared service
|
||||
|
||||
**Hint: "add dark mode support"**
|
||||
- Code changes: Add CSS variables and theme toggle component
|
||||
- Generated: ✨ [feat] Implement dark mode support with theme toggle
|
||||
|
||||
**Hint: "performance optimization"**
|
||||
- Code changes: Implement memoization for expensive calculations
|
||||
- Generated: ⚡️ [perf] Add memoization to optimize calculation performance
|
||||
|
||||
Example of splitting commits:
|
||||
- First commit: ✨ [feat] Add new solc version type definitions
|
||||
- Second commit: 📝 [docs] Update documentation for new solc versions
|
||||
- Third commit: 🔧 [chore] Update package.json dependencies
|
||||
- Fourth commit: 🏷 [feat] Add type definitions for new API endpoints
|
||||
- Fifth commit: 🧵 [feat] Improve concurrency handling in worker threads
|
||||
- Sixth commit: 🚨 [fix] Resolve linting issues in new code
|
||||
- Seventh commit: ✅ [test] Add unit tests for new solc version features
|
||||
- Eighth commit: 🔒️ [fix] Update dependencies with security vulnerabilities
|
||||
|
||||
## Important Notes
|
||||
|
||||
- **If no files are staged, abort the process immediately**.
|
||||
- **Commit staged files only**: Unstaged files are assumed to be intentionally excluded from the current commit.
|
||||
- **Do not make any pre-commit checks**. If a pre-commit hook is triggered and fails during the commit process, abort the process immediately.
|
||||
- **Process hints carefully**: When a hint is provided, analyze it to understand the user's intent, but always verify it aligns with the actual code changes.
|
||||
- **Hint priority**: Use hints to provide context and focus, but the actual code changes should determine the commit type and scope.
|
||||
- Before committing, review the diff to **identify if multiple commits would be more appropriate**.
|
||||
@@ -1,71 +0,0 @@
|
||||
---
|
||||
name: staged-code-reviewer
|
||||
description: Reviews staged git changes for quality, security, and performance. Analyzes files in the git index (git diff --cached) and provides actionable, line-by-line feedback.
|
||||
---
|
||||
|
||||
You are a specialized code review agent. Your sole function is to analyze git changes that have been staged for commit. You must ignore unstaged changes, untracked files, and non-code files (e.g., binaries, data). Your review should be direct, objective, and focused on providing actionable improvements.
|
||||
|
||||
## Core Directives
|
||||
|
||||
1. Analyze Staged Code: Use the output of `git diff --cached` as the exclusive source for your review.
|
||||
2. Prioritize by Impact: Focus first on security vulnerabilities and critical bugs, then on performance, and finally on code quality and style.
|
||||
3. Provide Actionable Feedback: Every identified issue must be accompanied by a concrete suggestion for improvement.
|
||||
|
||||
## Review Criteria
|
||||
|
||||
For each change, evaluate the following:
|
||||
|
||||
* Security: Check for hardcoded secrets, injection vulnerabilities (SQL, XSS), insecure direct object references, and missing authentication/authorization.
|
||||
* Correctness & Reliability: Verify the logic works as intended, includes proper error handling, and considers edge cases.
|
||||
* Performance: Identify inefficient algorithms, potential bottlenecks, and expensive operations (e.g., N+1 database queries).
|
||||
* Code Quality: Assess readability, simplicity, naming conventions, and code duplication (DRY principle).
|
||||
* Test Coverage: Ensure that new logic is accompanied by meaningful tests.
|
||||
|
||||
## Critical Issues to Flag Immediately
|
||||
|
||||
* Hardcoded credentials, API keys, or tokens.
|
||||
* SQL or command injection vulnerabilities.
|
||||
* Cross-Site Scripting (XSS) vulnerabilities.
|
||||
* Missing or incorrect authentication/authorization checks.
|
||||
* Use of unsafe functions like eval() without proper sanitization.
|
||||
|
||||
## Output Format
|
||||
|
||||
Your entire response must follow this structure. Do not deviate.
|
||||
|
||||
Start with a summary header:
|
||||
|
||||
Staged Code Review
|
||||
---
|
||||
Files Reviewed: [List of staged files]
|
||||
Total Changes: [Number of lines added/removed]
|
||||
|
||||
---
|
||||
|
||||
Then, for each file with issues, create a section:
|
||||
|
||||
### filename.ext
|
||||
|
||||
(One-line summary of the changes in this file.)
|
||||
|
||||
**CRITICAL ISSUES**
|
||||
* (Line X): [Concise Issue Title]
|
||||
Problem: [Clear description of the issue.]
|
||||
Suggestion: [Specific, actionable improvement.]
|
||||
Reasoning: [Why the change is necessary (e.g., security, performance).]
|
||||
|
||||
**MAJOR ISSUES**
|
||||
* (Line Y): [Concise Issue Title]
|
||||
Problem: [Clear description of the issue.]
|
||||
Suggestion: [Specific, actionable improvement, including code examples if helpful.]
|
||||
Reasoning: [Why the change is necessary.]
|
||||
|
||||
**MINOR ISSUES**
|
||||
* (Line Z): [Concise Issue Title]
|
||||
Problem: [Clear description of the issue.]
|
||||
Suggestion: [Specific, actionable improvement.]
|
||||
Reasoning: [Why the change is necessary.]
|
||||
|
||||
If a file has no issues, state: "No issues found."
|
||||
|
||||
If you see well-implemented code, you may optionally add a "Positive Feedback" section to acknowledge it.
|
||||
@@ -1 +0,0 @@
|
||||
Use staged-code-reviewer sub agent to perform code review
|
||||
@@ -1,13 +0,0 @@
|
||||
Please analyze and fix the GitHub issue: $ARGUMENTS.
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Use `gh issue view` to get the issue details
|
||||
2. Understand the problem described in the issue
|
||||
3. Search the codebase for relevant files
|
||||
4. Implement the necessary changes to fix the issue
|
||||
5. Write and run tests to verify the fix
|
||||
6. Ensure code passes linting and type checking
|
||||
7. Create a descriptive commit message
|
||||
|
||||
Remember to use the GitHub CLI (`gh`) for all GitHub-related tasks.
|
||||
@@ -1,16 +0,0 @@
|
||||
Use commit-crafter sub agent to make a standardized commit
|
||||
|
||||
## Usage
|
||||
|
||||
```
|
||||
/make-commit [hint]
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `hint` (optional): A brief description or context to help customize the commit message. The hint will be used to guide the commit message generation while maintaining conventional commit standards.
|
||||
|
||||
**Examples:**
|
||||
- `/make-commit` - Generate commit message based purely on code changes
|
||||
- `/make-commit "API refactoring"` - Guide the commit to focus on API-related changes
|
||||
- `/make-commit "fix user login bug"` - Provide context about the specific issue being fixed
|
||||
- `/make-commit "add dark mode support"` - Indicate the feature being added
|
||||
19
.github/workflows/deploy-doc.yml
vendored
19
.github/workflows/deploy-doc.yml
vendored
@@ -11,13 +11,18 @@ jobs:
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
persist-credentials: false
|
||||
- name: Build HTML
|
||||
uses: ammaraskar/sphinx-action@7.0.0
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
pre-build-command: |
|
||||
apt-get update && apt-get install -y git
|
||||
pip install uv
|
||||
uv pip install --system . .[docs]
|
||||
python-version: '3.10'
|
||||
- name: Install uv
|
||||
run: pip install uv
|
||||
- name: Install docs dependencies
|
||||
run: uv pip install --system -e ".[docs]"
|
||||
- name: Build HTML
|
||||
run: |
|
||||
cd docs
|
||||
make html
|
||||
- name: Upload artifacts
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
@@ -28,4 +33,4 @@ jobs:
|
||||
if: github.ref == 'refs/heads/main'
|
||||
with:
|
||||
github_token: ${{ secrets.GITHUB_TOKEN }}
|
||||
publish_dir: docs/build/html
|
||||
publish_dir: docs/build/html/
|
||||
|
||||
4
.github/workflows/pr-welcome.yml
vendored
4
.github/workflows/pr-welcome.yml
vendored
@@ -4,10 +4,6 @@ on:
|
||||
pull_request:
|
||||
types: [opened]
|
||||
|
||||
permissions:
|
||||
pull-requests: write
|
||||
issues: write
|
||||
|
||||
jobs:
|
||||
welcome:
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
2
.github/workflows/publish.yml
vendored
2
.github/workflows/publish.yml
vendored
@@ -2,6 +2,8 @@ name: Publish to PyPI
|
||||
|
||||
on:
|
||||
push:
|
||||
branches:
|
||||
- 'main'
|
||||
tags:
|
||||
- 'v*'
|
||||
|
||||
|
||||
6
.github/workflows/python-lint.yml
vendored
6
.github/workflows/python-lint.yml
vendored
@@ -21,7 +21,7 @@ jobs:
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
python -m pip install --upgrade pip
|
||||
pip install ruff
|
||||
pip install pre-commit
|
||||
|
||||
- name: Run ruff
|
||||
run: ruff check .
|
||||
- name: Run pre-commit
|
||||
run: pre-commit run --all-files
|
||||
|
||||
4
.github/workflows/test.yaml
vendored
4
.github/workflows/test.yaml
vendored
@@ -28,8 +28,8 @@ jobs:
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
uv sync --extra test
|
||||
uv sync --group test
|
||||
|
||||
- name: Run tests with pytest
|
||||
run: |
|
||||
uv run pytest -v tests/
|
||||
uv run pytest tests/
|
||||
|
||||
@@ -17,7 +17,6 @@ repos:
|
||||
- id: check-yaml
|
||||
- id: check-toml
|
||||
- id: check-added-large-files
|
||||
exclude: assets/
|
||||
- id: check-case-conflict
|
||||
- id: check-merge-conflict
|
||||
- id: debug-statements
|
||||
|
||||
37
README.md
37
README.md
@@ -2,16 +2,15 @@
|
||||
|
||||
<div align="center">
|
||||
<h1>
|
||||
<img src="./assets/fire.svg" width=60, height=60>
|
||||
<img src="./assets/fire.svg" width=30, height=30>
|
||||
𝚃𝚎𝚡𝚃𝚎𝚕𝚕𝚎𝚛
|
||||
<img src="./assets/fire.svg" width=60, height=60>
|
||||
<img src="./assets/fire.svg" width=30, height=30>
|
||||
</h1>
|
||||
|
||||
[](https://oleehyo.github.io/TexTeller/)
|
||||
[](https://arxiv.org/abs/2508.09220)
|
||||
[](https://huggingface.co/datasets/OleehyO/latex-formulas-80M)
|
||||
[](https://huggingface.co/OleehyO/TexTeller)
|
||||
[](https://hub.docker.com/r/oleehyo/texteller)
|
||||
[](https://huggingface.co/datasets/OleehyO/latex-formulas)
|
||||
[](https://huggingface.co/OleehyO/TexTeller)
|
||||
[](https://opensource.org/licenses/Apache-2.0)
|
||||
|
||||
</div>
|
||||
@@ -57,43 +56,37 @@ TexTeller was trained with **80M image-formula pairs** (previous dataset can be
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
## 📮 Change Log
|
||||
## 🔄 Change Log
|
||||
|
||||
- [2024-06-06] **TexTeller3.0 released!** The training data has been increased to **80M** (**10x more than** TexTeller2.0 and also improved in data diversity). TexTeller3.0's new features:
|
||||
- 📮[2024-06-06] **TexTeller3.0 released!** The training data has been increased to **80M** (**10x more than** TexTeller2.0 and also improved in data diversity). TexTeller3.0's new features:
|
||||
|
||||
- Support scanned image, handwritten formulas, English(Chinese) mixed formulas.
|
||||
|
||||
- OCR abilities in both Chinese and English for printed images.
|
||||
|
||||
- [2024-05-02] Support **paragraph recognition**.
|
||||
- 📮[2024-05-02] Support **paragraph recognition**.
|
||||
|
||||
- [2024-04-12] **Formula detection model** released!
|
||||
- 📮[2024-04-12] **Formula detection model** released!
|
||||
|
||||
- [2024-03-25] TexTeller2.0 released! The training data for TexTeller2.0 has been increased to 7.5M (15x more than TexTeller1.0 and also improved in data quality). The trained TexTeller2.0 demonstrated **superior performance** in the test set, especially in recognizing rare symbols, complex multi-line formulas, and matrices.
|
||||
- 📮[2024-03-25] TexTeller2.0 released! The training data for TexTeller2.0 has been increased to 7.5M (15x more than TexTeller1.0 and also improved in data quality). The trained TexTeller2.0 demonstrated **superior performance** in the test set, especially in recognizing rare symbols, complex multi-line formulas, and matrices.
|
||||
|
||||
> [Here](./assets/test.pdf) are more test images and a horizontal comparison of various recognition models.
|
||||
|
||||
## 🚀 Getting Started
|
||||
|
||||
1. Install uv:
|
||||
1. Install the project's dependencies:
|
||||
|
||||
```bash
|
||||
pip install uv
|
||||
pip install texteller
|
||||
```
|
||||
|
||||
2. Install the project's dependencies:
|
||||
2. If your are using CUDA backend, you may need to install `onnxruntime-gpu`:
|
||||
|
||||
```bash
|
||||
uv pip install texteller
|
||||
pip install texteller[onnxruntime-gpu]
|
||||
```
|
||||
|
||||
3. If your are using CUDA backend, you may need to install `onnxruntime-gpu`:
|
||||
|
||||
```bash
|
||||
uv pip install texteller[onnxruntime-gpu]
|
||||
```
|
||||
|
||||
4. Run the following command to start inference:
|
||||
3. Run the following command to start inference:
|
||||
|
||||
```bash
|
||||
texteller inference "/path/to/image.{jpg,png}"
|
||||
@@ -171,7 +164,7 @@ Please setup your environment before training:
|
||||
1. Install the dependencies for training:
|
||||
|
||||
```bash
|
||||
uv pip install texteller[train]
|
||||
pip install texteller[train]
|
||||
```
|
||||
|
||||
2. Clone the repository:
|
||||
|
||||
@@ -1,17 +1,16 @@
|
||||
📄 中文 | [English](../README.md)
|
||||
📄 中文 | [English](./README.md)
|
||||
|
||||
<div align="center">
|
||||
<h1>
|
||||
<img src="./fire.svg" width=60, height=60>
|
||||
<img src="./fire.svg" width=30, height=30>
|
||||
𝚃𝚎𝚡𝚃𝚎𝚕𝚕𝚎𝚛
|
||||
<img src="./fire.svg" width=60, height=60>
|
||||
<img src="./fire.svg" width=30, height=30>
|
||||
</h1>
|
||||
|
||||
[](https://oleehyo.github.io/TexTeller/)
|
||||
[](https://arxiv.org/abs/2508.09220)
|
||||
[](https://hub.docker.com/r/oleehyo/texteller)
|
||||
[](https://huggingface.co/datasets/OleehyO/latex-formulas-80M)
|
||||
[](https://huggingface.co/OleehyO/TexTeller)
|
||||
[](https://huggingface.co/datasets/OleehyO/latex-formulas)
|
||||
[](https://huggingface.co/OleehyO/TexTeller)
|
||||
[](https://opensource.org/licenses/Apache-2.0)
|
||||
|
||||
</div>
|
||||
@@ -71,29 +70,23 @@ TexTeller 使用 **8千万图像-公式对** 进行训练(前代数据集可
|
||||
|
||||
- [2024-03-25] TexTeller2.0 发布!TexTeller2.0 的训练数据增至750万(是前代的15倍并提升了数据质量)。训练后的 TexTeller2.0 在测试集中展现了**更优性能**,特别是在识别罕见符号、复杂多行公式和矩阵方面表现突出。
|
||||
|
||||
> [此处](./test.pdf) 展示了更多测试图像及各类识别模型的横向对比。
|
||||
> [此处](./assets/test.pdf) 展示了更多测试图像及各类识别模型的横向对比。
|
||||
|
||||
## 🚀 快速开始
|
||||
|
||||
1. 安装uv:
|
||||
1. 安装项目依赖:
|
||||
|
||||
```bash
|
||||
pip install uv
|
||||
pip install texteller
|
||||
```
|
||||
|
||||
2. 安装项目依赖:
|
||||
2. 若使用 CUDA 后端,可能需要安装 `onnxruntime-gpu`:
|
||||
|
||||
```bash
|
||||
uv pip install texteller
|
||||
pip install texteller[onnxruntime-gpu]
|
||||
```
|
||||
|
||||
3. 若使用 CUDA 后端,可能需要安装 `onnxruntime-gpu`:
|
||||
|
||||
```bash
|
||||
uv pip install texteller[onnxruntime-gpu]
|
||||
```
|
||||
|
||||
4. 运行以下命令开始推理:
|
||||
3. 运行以下命令开始推理:
|
||||
|
||||
```bash
|
||||
texteller inference "/path/to/image.{jpg,png}"
|
||||
@@ -103,7 +96,7 @@ TexTeller 使用 **8千万图像-公式对** 进行训练(前代数据集可
|
||||
|
||||
## 🌐 网页演示
|
||||
|
||||
命令行运行:
|
||||
运行命令:
|
||||
|
||||
```bash
|
||||
texteller web
|
||||
@@ -159,7 +152,7 @@ print(response.text)
|
||||
TexTeller的公式检测模型在3415张中文资料图像和8272张[IBEM数据集](https://zenodo.org/records/4757865)图像上训练。
|
||||
|
||||
<div align="center">
|
||||
<img src="./det_rec.png" width=250>
|
||||
<img src="./assets/det_rec.png" width=250>
|
||||
</div>
|
||||
|
||||
我们在Python接口中提供了公式检测接口,详见[接口文档](https://oleehyo.github.io/TexTeller/)。
|
||||
@@ -171,7 +164,7 @@ TexTeller的公式检测模型在3415张中文资料图像和8272张[IBEM数据
|
||||
1. 安装训练依赖:
|
||||
|
||||
```bash
|
||||
uv pip install texteller[train]
|
||||
pip install texteller[train]
|
||||
```
|
||||
|
||||
2. 克隆仓库:
|
||||
@@ -192,7 +185,7 @@ TexTeller的公式检测模型在3415张中文资料图像和8272张[IBEM数据
|
||||
accelerate launch train.py
|
||||
```
|
||||
|
||||
训练参数可通过[`train_config.yaml`](../examples/train_texteller/train_config.yaml)调整。
|
||||
训练参数可通过[`train_config.yaml`](./examples/train_texteller/train_config.yaml)调整。
|
||||
|
||||
## 📅 计划列表
|
||||
|
||||
|
||||
Binary file not shown.
Binary file not shown.
@@ -457,4 +457,4 @@
|
||||
<animate attributeName="cy" values="114.80243604193255;7.19374553530416" keyTimes="0;1" dur="1s" repeatCount="indefinite" begin="-0.6866227460985781s"></animate>
|
||||
<animate attributeName="r" values="9;0;0" keyTimes="0;0.6690048284116141;1" dur="1s" repeatCount="indefinite" begin="-0.6866227460985781s"></animate>
|
||||
</circle></g>
|
||||
</svg>
|
||||
</svg>
|
||||
|
Before Width: | Height: | Size: 58 KiB After Width: | Height: | Size: 58 KiB |
@@ -1,10 +1,9 @@
|
||||
|
||||
<svg xmlns="http://www.w3.org/2000/svg" width="430" height="80" viewBox="0 0 430 80">
|
||||
<svg xmlns="http://www.w3.org/2000/svg" width="354" height="100" viewBox="0 0 354 100">
|
||||
|
||||
<text
|
||||
x="50%"
|
||||
y="50%"
|
||||
font-family="monaco"
|
||||
font-family="Arial, sans-serif"
|
||||
font-size="55"
|
||||
text-anchor="middle"
|
||||
dominant-baseline="middle">
|
||||
|
||||
|
Before Width: | Height: | Size: 377 B After Width: | Height: | Size: 389 B |
@@ -12,64 +12,64 @@
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.abspath("../.."))
|
||||
sys.path.insert(0, os.path.abspath('../..'))
|
||||
|
||||
# -- Project information -----------------------------------------------------
|
||||
|
||||
project = "TexTeller"
|
||||
copyright = "2025, TexTeller Team"
|
||||
author = "TexTeller Team"
|
||||
project = 'TexTeller'
|
||||
copyright = '2025, TexTeller Team'
|
||||
author = 'TexTeller Team'
|
||||
|
||||
# -- General configuration ---------------------------------------------------
|
||||
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
|
||||
|
||||
extensions = [
|
||||
"myst_parser",
|
||||
"sphinx.ext.duration",
|
||||
"sphinx.ext.intersphinx",
|
||||
"sphinx.ext.autosectionlabel",
|
||||
"sphinx.ext.autodoc",
|
||||
"sphinx.ext.viewcode",
|
||||
"sphinx.ext.napoleon",
|
||||
"sphinx.ext.autosummary",
|
||||
"sphinx_copybutton",
|
||||
'myst_parser',
|
||||
'sphinx.ext.duration',
|
||||
'sphinx.ext.intersphinx',
|
||||
'sphinx.ext.autosectionlabel',
|
||||
'sphinx.ext.autodoc',
|
||||
'sphinx.ext.viewcode',
|
||||
'sphinx.ext.napoleon',
|
||||
'sphinx.ext.autosummary',
|
||||
'sphinx_copybutton',
|
||||
# 'sphinx.ext.linkcode',
|
||||
# 'sphinxarg.ext',
|
||||
"sphinx_design",
|
||||
"nbsphinx",
|
||||
'sphinx_design',
|
||||
'nbsphinx',
|
||||
]
|
||||
|
||||
templates_path = ["_templates"]
|
||||
templates_path = ['_templates']
|
||||
exclude_patterns = []
|
||||
|
||||
# Autodoc settings
|
||||
autodoc_member_order = "bysource"
|
||||
autodoc_member_order = 'bysource'
|
||||
add_module_names = False
|
||||
autoclass_content = "both"
|
||||
autoclass_content = 'both'
|
||||
autodoc_default_options = {
|
||||
"members": True,
|
||||
"member-order": "bysource",
|
||||
"undoc-members": True,
|
||||
"show-inheritance": True,
|
||||
"imported-members": True,
|
||||
'members': True,
|
||||
'member-order': 'bysource',
|
||||
'undoc-members': True,
|
||||
'show-inheritance': True,
|
||||
'imported-members': True,
|
||||
}
|
||||
|
||||
# Intersphinx settings
|
||||
intersphinx_mapping = {
|
||||
"python": ("https://docs.python.org/3", None),
|
||||
"numpy": ("https://numpy.org/doc/stable", None),
|
||||
"torch": ("https://pytorch.org/docs/stable", None),
|
||||
"transformers": ("https://huggingface.co/docs/transformers/main/en", None),
|
||||
'python': ('https://docs.python.org/3', None),
|
||||
'numpy': ('https://numpy.org/doc/stable', None),
|
||||
'torch': ('https://pytorch.org/docs/stable', None),
|
||||
'transformers': ('https://huggingface.co/docs/transformers/main/en', None),
|
||||
}
|
||||
|
||||
html_theme = "sphinx_book_theme"
|
||||
html_theme = 'sphinx_book_theme'
|
||||
|
||||
html_theme_options = {
|
||||
"repository_url": "https://github.com/OleehyO/TexTeller",
|
||||
"use_repository_button": True,
|
||||
"use_issues_button": True,
|
||||
"use_edit_page_button": True,
|
||||
"use_download_button": True,
|
||||
'repository_url': 'https://github.com/OleehyO/TexTeller',
|
||||
'use_repository_button': True,
|
||||
'use_issues_button': True,
|
||||
'use_edit_page_button': True,
|
||||
'use_download_button': True,
|
||||
}
|
||||
|
||||
html_logo = "../../assets/logo.svg"
|
||||
|
||||
@@ -20,8 +20,7 @@ You can install TexTeller using pip:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install uv
|
||||
uv pip install texteller
|
||||
pip install texteller
|
||||
|
||||
Quick Start
|
||||
----------
|
||||
@@ -41,7 +40,7 @@ Converting an image to LaTeX:
|
||||
|
||||
Processing a mixed text/formula image:
|
||||
|
||||
.. code-block:: python
|
||||
.. code-block::python
|
||||
|
||||
from texteller import (
|
||||
load_model, load_tokenizer, load_latexdet_model,
|
||||
|
||||
@@ -3,8 +3,8 @@ import requests
|
||||
server_url = "http://127.0.0.1:8000/predict"
|
||||
|
||||
img_path = "/path/to/your/image"
|
||||
with open(img_path, "rb") as img:
|
||||
files = {"img": img}
|
||||
with open(img_path, 'rb') as img:
|
||||
files = {'img': img}
|
||||
response = requests.post(server_url, files=files)
|
||||
|
||||
print(response.text)
|
||||
|
||||
@@ -22,7 +22,7 @@ dependencies = [
|
||||
"streamlit-paste-button>=0.1.2",
|
||||
"torch>=2.6.0",
|
||||
"torchvision>=0.21.0",
|
||||
"transformers==4.47",
|
||||
"transformers==4.45.2",
|
||||
"wget>=3.2",
|
||||
"optimum[onnxruntime]>=1.24.0",
|
||||
"python-multipart>=0.0.20",
|
||||
@@ -44,6 +44,7 @@ quote-style = "double"
|
||||
[tool.ruff.lint]
|
||||
select = ["E", "W"]
|
||||
ignore = [
|
||||
"E999",
|
||||
"EXE001",
|
||||
"UP009",
|
||||
"F401",
|
||||
|
||||
@@ -19,8 +19,8 @@ TEXT_LINE_START = ""
|
||||
COMMENT_LINE_START = "% "
|
||||
|
||||
# Opening and closing delimiters
|
||||
OPENS = ["{", "(", "["]
|
||||
CLOSES = ["}", ")", "]"]
|
||||
OPENS = ['{', '(', '[']
|
||||
CLOSES = ['}', ')', ']']
|
||||
|
||||
# Names of LaTeX verbatim environments
|
||||
VERBATIMS = ["verbatim", "Verbatim", "lstlisting", "minted", "comment"]
|
||||
@@ -138,7 +138,7 @@ class Pattern:
|
||||
contains_env_end=ENV_END in s,
|
||||
contains_item=ITEM in s,
|
||||
contains_splitting=True,
|
||||
contains_comment="%" in s,
|
||||
contains_comment='%' in s,
|
||||
)
|
||||
else:
|
||||
return cls(
|
||||
@@ -146,7 +146,7 @@ class Pattern:
|
||||
contains_env_end=False,
|
||||
contains_item=False,
|
||||
contains_splitting=False,
|
||||
contains_comment="%" in s,
|
||||
contains_comment='%' in s,
|
||||
)
|
||||
|
||||
|
||||
@@ -169,11 +169,11 @@ def find_comment_index(line: str, pattern: Pattern) -> Optional[int]:
|
||||
|
||||
in_command = False
|
||||
for i, c in enumerate(line):
|
||||
if c == "\\":
|
||||
if c == '\\':
|
||||
in_command = True
|
||||
elif in_command and not c.isalpha():
|
||||
in_command = False
|
||||
elif c == "%" and not in_command:
|
||||
elif c == '%' and not in_command:
|
||||
return i
|
||||
|
||||
return None
|
||||
@@ -390,10 +390,10 @@ def find_wrap_point(line: str, indent_length: int, args: Args) -> Optional[int]:
|
||||
line_width += 1
|
||||
if line_width > wrap_boundary and wrap_point is not None:
|
||||
break
|
||||
if c == " " and prev_char != "\\":
|
||||
if c == ' ' and prev_char != '\\':
|
||||
if after_char:
|
||||
wrap_point = i
|
||||
elif c != "%":
|
||||
elif c != '%':
|
||||
after_char = True
|
||||
prev_char = c
|
||||
|
||||
@@ -483,8 +483,8 @@ def split_line(line: str, state: State, file: str, args: Args, logs: List[Log])
|
||||
if not match:
|
||||
return line, ""
|
||||
|
||||
prev = match.group("prev")
|
||||
rest = match.group("env")
|
||||
prev = match.group('prev')
|
||||
rest = match.group('env')
|
||||
|
||||
if args.verbosity >= 3: # Trace level
|
||||
logs.append(
|
||||
@@ -517,8 +517,8 @@ def clean_text(text: str, args: Args) -> str:
|
||||
text = RE_NEWLINES.sub(f"{LINE_END}{LINE_END}", text)
|
||||
|
||||
# Remove tabs if they shouldn't be used
|
||||
if args.tabchar != "\t":
|
||||
text = text.replace("\t", " " * args.tabsize)
|
||||
if args.tabchar != '\t':
|
||||
text = text.replace('\t', ' ' * args.tabsize)
|
||||
|
||||
# Remove trailing spaces
|
||||
text = RE_TRAIL.sub(LINE_END, text)
|
||||
@@ -577,7 +577,7 @@ def _format_latex(old_text: str, file: str, args: Args) -> Tuple[str, List[Log]]
|
||||
new_text = ""
|
||||
|
||||
# Select the character used for indentation
|
||||
indent_char = "\t" if args.tabchar == "\t" else " "
|
||||
indent_char = '\t' if args.tabchar == '\t' else ' '
|
||||
|
||||
# Get any extra environments to be indented as lists
|
||||
lists_begin = [f"\\begin{{{l}}}" for l in args.lists]
|
||||
|
||||
@@ -5,13 +5,13 @@ from .format import format_latex
|
||||
|
||||
|
||||
def _rm_dollar_surr(content):
|
||||
pattern = re.compile(r"\\[a-zA-Z]+\$.*?\$|\$.*?\$")
|
||||
pattern = re.compile(r'\\[a-zA-Z]+\$.*?\$|\$.*?\$')
|
||||
matches = pattern.findall(content)
|
||||
|
||||
for match in matches:
|
||||
if not re.match(r"\\[a-zA-Z]+", match):
|
||||
new_match = match.strip("$")
|
||||
content = content.replace(match, " " + new_match + " ")
|
||||
if not re.match(r'\\[a-zA-Z]+', match):
|
||||
new_match = match.strip('$')
|
||||
content = content.replace(match, ' ' + new_match + ' ')
|
||||
|
||||
return content
|
||||
|
||||
@@ -33,97 +33,97 @@ def to_katex(formula: str) -> str:
|
||||
"""
|
||||
res = formula
|
||||
# remove mbox surrounding
|
||||
res = change_all(res, r"\mbox ", r" ", r"{", r"}", r"", r"")
|
||||
res = change_all(res, r"\mbox", r" ", r"{", r"}", r"", r"")
|
||||
res = change_all(res, r'\mbox ', r' ', r'{', r'}', r'', r'')
|
||||
res = change_all(res, r'\mbox', r' ', r'{', r'}', r'', r'')
|
||||
# remove hbox surrounding
|
||||
res = re.sub(r"\\hbox to ?-? ?\d+\.\d+(pt)?\{", r"\\hbox{", res)
|
||||
res = change_all(res, r"\hbox", r" ", r"{", r"}", r"", r" ")
|
||||
res = re.sub(r'\\hbox to ?-? ?\d+\.\d+(pt)?\{', r'\\hbox{', res)
|
||||
res = change_all(res, r'\hbox', r' ', r'{', r'}', r'', r' ')
|
||||
# remove raise surrounding
|
||||
res = re.sub(r"\\raise ?-? ?\d+\.\d+(pt)?", r" ", res)
|
||||
res = re.sub(r'\\raise ?-? ?\d+\.\d+(pt)?', r' ', res)
|
||||
# remove makebox
|
||||
res = re.sub(r"\\makebox ?\[\d+\.\d+(pt)?\]\{", r"\\makebox{", res)
|
||||
res = change_all(res, r"\makebox", r" ", r"{", r"}", r"", r" ")
|
||||
res = re.sub(r'\\makebox ?\[\d+\.\d+(pt)?\]\{', r'\\makebox{', res)
|
||||
res = change_all(res, r'\makebox', r' ', r'{', r'}', r'', r' ')
|
||||
# remove vbox surrounding, scalebox surrounding
|
||||
res = re.sub(r"\\raisebox\{-? ?\d+\.\d+(pt)?\}\{", r"\\raisebox{", res)
|
||||
res = re.sub(r"\\scalebox\{-? ?\d+\.\d+(pt)?\}\{", r"\\scalebox{", res)
|
||||
res = change_all(res, r"\scalebox", r" ", r"{", r"}", r"", r" ")
|
||||
res = change_all(res, r"\raisebox", r" ", r"{", r"}", r"", r" ")
|
||||
res = change_all(res, r"\vbox", r" ", r"{", r"}", r"", r" ")
|
||||
res = re.sub(r'\\raisebox\{-? ?\d+\.\d+(pt)?\}\{', r'\\raisebox{', res)
|
||||
res = re.sub(r'\\scalebox\{-? ?\d+\.\d+(pt)?\}\{', r'\\scalebox{', res)
|
||||
res = change_all(res, r'\scalebox', r' ', r'{', r'}', r'', r' ')
|
||||
res = change_all(res, r'\raisebox', r' ', r'{', r'}', r'', r' ')
|
||||
res = change_all(res, r'\vbox', r' ', r'{', r'}', r'', r' ')
|
||||
|
||||
origin_instructions = [
|
||||
r"\Huge",
|
||||
r"\huge",
|
||||
r"\LARGE",
|
||||
r"\Large",
|
||||
r"\large",
|
||||
r"\normalsize",
|
||||
r"\small",
|
||||
r"\footnotesize",
|
||||
r"\tiny",
|
||||
r'\Huge',
|
||||
r'\huge',
|
||||
r'\LARGE',
|
||||
r'\Large',
|
||||
r'\large',
|
||||
r'\normalsize',
|
||||
r'\small',
|
||||
r'\footnotesize',
|
||||
r'\tiny',
|
||||
]
|
||||
for old_ins, new_ins in zip(origin_instructions, origin_instructions):
|
||||
res = change_all(res, old_ins, new_ins, r"$", r"$", "{", "}")
|
||||
res = change_all(res, r"\mathbf", r"\bm", r"{", r"}", r"{", r"}")
|
||||
res = change_all(res, r"\boldmath ", r"\bm", r"{", r"}", r"{", r"}")
|
||||
res = change_all(res, r"\boldmath", r"\bm", r"{", r"}", r"{", r"}")
|
||||
res = change_all(res, r"\boldmath ", r"\bm", r"$", r"$", r"{", r"}")
|
||||
res = change_all(res, r"\boldmath", r"\bm", r"$", r"$", r"{", r"}")
|
||||
res = change_all(res, r"\scriptsize", r"\scriptsize", r"$", r"$", r"{", r"}")
|
||||
res = change_all(res, r"\emph", r"\textit", r"{", r"}", r"{", r"}")
|
||||
res = change_all(res, r"\emph ", r"\textit", r"{", r"}", r"{", r"}")
|
||||
res = change_all(res, old_ins, new_ins, r'$', r'$', '{', '}')
|
||||
res = change_all(res, r'\mathbf', r'\bm', r'{', r'}', r'{', r'}')
|
||||
res = change_all(res, r'\boldmath ', r'\bm', r'{', r'}', r'{', r'}')
|
||||
res = change_all(res, r'\boldmath', r'\bm', r'{', r'}', r'{', r'}')
|
||||
res = change_all(res, r'\boldmath ', r'\bm', r'$', r'$', r'{', r'}')
|
||||
res = change_all(res, r'\boldmath', r'\bm', r'$', r'$', r'{', r'}')
|
||||
res = change_all(res, r'\scriptsize', r'\scriptsize', r'$', r'$', r'{', r'}')
|
||||
res = change_all(res, r'\emph', r'\textit', r'{', r'}', r'{', r'}')
|
||||
res = change_all(res, r'\emph ', r'\textit', r'{', r'}', r'{', r'}')
|
||||
|
||||
# remove bold command
|
||||
res = change_all(res, r"\bm", r" ", r"{", r"}", r"", r"")
|
||||
res = change_all(res, r'\bm', r' ', r'{', r'}', r'', r'')
|
||||
|
||||
origin_instructions = [
|
||||
r"\left",
|
||||
r"\middle",
|
||||
r"\right",
|
||||
r"\big",
|
||||
r"\Big",
|
||||
r"\bigg",
|
||||
r"\Bigg",
|
||||
r"\bigl",
|
||||
r"\Bigl",
|
||||
r"\biggl",
|
||||
r"\Biggl",
|
||||
r"\bigm",
|
||||
r"\Bigm",
|
||||
r"\biggm",
|
||||
r"\Biggm",
|
||||
r"\bigr",
|
||||
r"\Bigr",
|
||||
r"\biggr",
|
||||
r"\Biggr",
|
||||
r'\left',
|
||||
r'\middle',
|
||||
r'\right',
|
||||
r'\big',
|
||||
r'\Big',
|
||||
r'\bigg',
|
||||
r'\Bigg',
|
||||
r'\bigl',
|
||||
r'\Bigl',
|
||||
r'\biggl',
|
||||
r'\Biggl',
|
||||
r'\bigm',
|
||||
r'\Bigm',
|
||||
r'\biggm',
|
||||
r'\Biggm',
|
||||
r'\bigr',
|
||||
r'\Bigr',
|
||||
r'\biggr',
|
||||
r'\Biggr',
|
||||
]
|
||||
for origin_ins in origin_instructions:
|
||||
res = change_all(res, origin_ins, origin_ins, r"{", r"}", r"", r"")
|
||||
res = change_all(res, origin_ins, origin_ins, r'{', r'}', r'', r'')
|
||||
|
||||
res = re.sub(r"\\\[(.*?)\\\]", r"\1\\newline", res)
|
||||
res = re.sub(r'\\\[(.*?)\\\]', r'\1\\newline', res)
|
||||
|
||||
if res.endswith(r"\newline"):
|
||||
if res.endswith(r'\newline'):
|
||||
res = res[:-8]
|
||||
|
||||
# remove multiple spaces
|
||||
res = re.sub(r"(\\,){1,}", " ", res)
|
||||
res = re.sub(r"(\\!){1,}", " ", res)
|
||||
res = re.sub(r"(\\;){1,}", " ", res)
|
||||
res = re.sub(r"(\\:){1,}", " ", res)
|
||||
res = re.sub(r"\\vspace\{.*?}", "", res)
|
||||
res = re.sub(r'(\\,){1,}', ' ', res)
|
||||
res = re.sub(r'(\\!){1,}', ' ', res)
|
||||
res = re.sub(r'(\\;){1,}', ' ', res)
|
||||
res = re.sub(r'(\\:){1,}', ' ', res)
|
||||
res = re.sub(r'\\vspace\{.*?}', '', res)
|
||||
|
||||
# merge consecutive text
|
||||
def merge_texts(match):
|
||||
texts = match.group(0)
|
||||
merged_content = "".join(re.findall(r"\\text\{([^}]*)\}", texts))
|
||||
return f"\\text{{{merged_content}}}"
|
||||
merged_content = ''.join(re.findall(r'\\text\{([^}]*)\}', texts))
|
||||
return f'\\text{{{merged_content}}}'
|
||||
|
||||
res = re.sub(r"(\\text\{[^}]*\}\s*){2,}", merge_texts, res)
|
||||
res = re.sub(r'(\\text\{[^}]*\}\s*){2,}', merge_texts, res)
|
||||
|
||||
res = res.replace(r"\bf ", "")
|
||||
res = res.replace(r'\bf ', '')
|
||||
res = _rm_dollar_surr(res)
|
||||
|
||||
# remove extra spaces (keeping only one)
|
||||
res = re.sub(r" +", " ", res)
|
||||
res = re.sub(r' +', ' ', res)
|
||||
|
||||
# format latex
|
||||
res = res.strip()
|
||||
|
||||
@@ -1,3 +1,3 @@
|
||||
from .texteller import TexTeller
|
||||
|
||||
__all__ = ["TexTeller"]
|
||||
__all__ = ['TexTeller']
|
||||
|
||||
@@ -41,7 +41,7 @@ def readimgs(image_paths: list[str]) -> list[np.ndarray]:
|
||||
if image is None:
|
||||
raise ValueError(f"Image at {path} could not be read.")
|
||||
if image.dtype == np.uint16:
|
||||
_logger.warning(f"Converting {path} to 8-bit, image may be lossy.")
|
||||
_logger.warning(f'Converting {path} to 8-bit, image may be lossy.')
|
||||
image = cv2.convertScaleAbs(image, alpha=(255.0 / 65535.0))
|
||||
|
||||
channels = 1 if len(image.shape) == 2 else image.shape[2]
|
||||
@@ -112,7 +112,7 @@ def transform(images: List[Union[np.ndarray, Image.Image]]) -> List[torch.Tensor
|
||||
|
||||
assert IMG_CHANNELS == 1, "Only support grayscale images for now"
|
||||
images = [
|
||||
np.array(img.convert("RGB")) if isinstance(img, Image.Image) else img for img in images
|
||||
np.array(img.convert('RGB')) if isinstance(img, Image.Image) else img for img in images
|
||||
]
|
||||
images = [trim_white_border(image) for image in images]
|
||||
images = [general_transform_pipeline(image) for image in images]
|
||||
|
||||
@@ -21,7 +21,7 @@ def _change(input_str, old_inst, new_inst, old_surr_l, old_surr_r, new_surr_l, n
|
||||
j = start + 1
|
||||
escaped = False
|
||||
while j < n and count > 0:
|
||||
if input_str[j] == "\\" and not escaped:
|
||||
if input_str[j] == '\\' and not escaped:
|
||||
escaped = True
|
||||
j += 1
|
||||
continue
|
||||
@@ -71,10 +71,10 @@ def change_all(input_str, old_inst, new_inst, old_surr_l, old_surr_r, new_surr_l
|
||||
for p in pos[::-1]:
|
||||
res[p:] = list(
|
||||
_change(
|
||||
"".join(res[p:]), old_inst, new_inst, old_surr_l, old_surr_r, new_surr_l, new_surr_r
|
||||
''.join(res[p:]), old_inst, new_inst, old_surr_l, old_surr_r, new_surr_l, new_surr_r
|
||||
)
|
||||
)
|
||||
res = "".join(res)
|
||||
res = ''.join(res)
|
||||
return res
|
||||
|
||||
|
||||
@@ -121,7 +121,7 @@ def add_newlines(latex_str: str) -> str:
|
||||
|
||||
# 4. Cleanup: Collapse multiple consecutive newlines into a single newline.
|
||||
# This handles cases where the replacements above might have created \n\n.
|
||||
processed_str = re.sub(r"\n{2,}", "\n", processed_str)
|
||||
processed_str = re.sub(r'\n{2,}', '\n', processed_str)
|
||||
|
||||
# Remove leading/trailing whitespace (including potential single newlines
|
||||
# at the very start/end resulting from the replacements) from the entire result.
|
||||
|
||||
Reference in New Issue
Block a user