Version: 1.0.0 | Last Updated: 2024

نخستین آشنایی

Introduction | آشنایی 

English

Pasban is a pure Persian text processing library for detecting foreign (non-Persian) words. It offers both Aho-Corasick and regex-based detection engines, advanced normalization, and contextual cleaning. It is designed for high accuracy, speed, and extensibility.

Key Features:

Fast multi-pattern matching with Aho-Corasick algorithm
Regex-based detection for maximum accuracy
Advanced Persian text normalization
Contextual cleaning and boundary detection
Comprehensive statistics and reporting
Extensible word database
Fully offline after initial setup

Links:

GitHub: pasban-py
Website: pasban

پارسی

پاسبان کتابخانه‌ای برای پردازش متن پارسی سره است که شناسایی واژگان بیگانه را با موشکافی و سرعت بالا انجام می‌دهد. این کتابخانه موتورهای جستجوی چندالگویی (آهو-کُراسیک) و برپایهٔ عبارت منظم، نرمال‌سازی و پالایش زمینه‌ای را فراهم می‌کند.

ویژگی‌های کلیدی:

جستجوی چندالگویی سریع با الگوریتم آهو-کُراسیک
شناسایی برپایهٔ عبارت منظم برای موشکافی بیشینه
نرمال‌سازی پیشرفته متن پارسی
پالایش زمینه‌ای و تشخیص مرزهای واژه
آمار و گزارش‌دهی جامع
پایگاه واژگان قابل گسترش
کاملاً آفلاین پس از راه‌اندازی نخستین

لینک‌ها:

گیت‌هاب: pasban-py
تارنما: pasban

Installation | نصب 

English

pip install pasban

Requirements:

Python 3.8+
Internet connection (only for initial database download)

پارسی

برای نصب پاسبان کافی است دستور زیر را اجرا کنید:

pip install pasban

پیش‌نیازها:

پایتون ۳.۸ یا بالاتر
پیوند به اینترنت (تنها برای بارگیری نخستین پایگاه‌داده)

Quick Start Example | نمونه شروع سریع

from pasban.detector import WordDetector

# Initialize detector
detector = WordDetector()

# Detect foreign words
text = "من با کامپیوتر کار می‌کنم و از اینترنت استفاده می‌کنم."
result = detector.detect(text)

# Print results
print(f"Foreign words: {result.foreign_words}")
print(f"Percentage: {result.foreign_percentage:.1f}%")
print(f"\nReport:\n{result.to_summary_text}")

Modules | ساختار ماژول‌ها 

English

detector: Word detection engines (WordDetector, WordDetectorRegex)
db: Word database access (WordRepo, DataLoader)
normalizer: Text normalization and contextual cleaning
core: Core data types (DetectData, AhoCorasickAutomaton)

پارسی

detector: موتورهای شناسایی واژه (WordDetector و WordDetectorRegex)
db: پایگاه واژگان (WordRepo و DataLoader)
normalizer: نرمال‌سازی و پالایش متن
core: انواع دادهٔ پایه مانند DetectData و AhoCorasickAutomaton

WordDetector | WordDetectorRegex

English

WordDetector uses the Aho-Corasick automaton for fast, multi-pattern matching. It normalizes and contextually cleans text, ensuring accurate word boundaries. WordDetectorRegex uses a compiled regex pattern for sometimes higher accuracy.

Constructor

from pasban.detector import WordDetector, WordDetectorRegex

detector = WordDetector()
detector_regex = WordDetectorRegex()

Methods

detect(text: str, normalize: bool = True, contextual: bool = True) → DetectData

Detect foreign words and return a DetectData object with full statistics.

پارامترها:

text -- Input text to analyze
normalize -- Apply text normalization (default: True)
contextual -- Apply contextual cleaning (default: True)

بازگشت ها:

DetectData object with detection results and statistics

detect_words(text: str, normalize: bool = True, contextual: bool = True) → dict[str, str]

Return only detected words and their Persian equivalents.

پارامترها:

text -- Input text to analyze
normalize -- Apply text normalization (default: True)
contextual -- Apply contextual cleaning (default: True)

بازگشت ها:

Dictionary mapping foreign words to Persian equivalents

find_words_in_text(text: str) → list[str]

Find all foreign words in text (WordDetectorRegex only).

پارامترها:: text -- Input text to analyze
بازگشت ها:: List of detected foreign words

reload() → None: Reload words from the database.

Example Usage

from pasban.detector import WordDetector

# Initialize detector (one-time setup)
detector = WordDetector()

# Detect foreign words with full statistics
text = "این متن شامل کامپیوتر و اینترنت است."
result = detector.detect(text)

# Access detection results
print(f"Foreign words: {result.foreign_words}")
# Output: ['کامپیوتر', 'اینترنت']

print(f"Mappings: {result.words}")
# Output: {'کامپیوتر': 'رایانه', 'اینترنت': 'اینترنت'}

print(f"Total occurrences: {result.count}")
# Output: 2

print(f"Unique words: {result.unique_count}")
# Output: 2

print(f"Foreign percentage: {result.foreign_percentage:.2f}%")
# Output: 28.57%

# Get Persian report
print(result.to_text)
# Output: گزارش کامل به پارسی

# Get summary report
print(result.to_summary_text)
# Output: خلاصه آماری

# Or just get the words dictionary
words = detector.detect_words(text)
print(words)
# Output: {'کامپیوتر': 'رایانه', 'اینترنت': 'اینترنت'}

Advanced Usage

from pasban.db import WordRepo

repo = WordRepo()

# Reload database from disk
all_words = repo.get_all_words(reload=True)

# Batch operations
new_words = {
    "وبسایت": "وبگاه",
    "ایمیل": "رایانامه",
    "فایل": "پرونده"
}

for foreign, persian in new_words.items():
    repo.add_word(foreign, persian)

# Search with different terms
computer_words = repo.search_word("رایانه", limit=10)
print(f"Found {len(computer_words)} computer-related words")

پارسی

WordDetector با بهره‌گیری از آهو-کُراسیک، شناسایی واژگان بیگانه را با سرعت و موشکافی بالا انجام می‌دهد. WordDetectorRegex با عبارت منظم گاهی موشکافی بیشتری دارد.

سازنده

from pasban.detector import WordDetector, WordDetectorRegex

detector = WordDetector()
detector_regex = WordDetectorRegex()

متدها

detect(text: str, normalize: bool = True, contextual: bool = True) → DetectData

شناسایی واژگان بیگانه و بازگرداندن شیء DetectData با آمار کامل.

پارامترها:

text -- متن ورودی برای تحلیل
normalize -- اعمال نرمال‌سازی متن (پیش‌فرض: True)
contextual -- اعمال پالایش زمینه‌ای (پیش‌فرض: True)

بازگشت ها:

شیء DetectData با نتایج و آمار شناسایی

detect_words(text: str, normalize: bool = True, contextual: bool = True) → dict[str, str]

بازگرداندن فقط واژگان بیگانه و برابر پارسی آن‌ها.

پارامترها:

text -- متن ورودی برای تحلیل
normalize -- اعمال نرمال‌سازی متن (پیش‌فرض: True)
contextual -- اعمال پالایش زمینه‌ای (پیش‌فرض: True)

بازگشت ها:

دیکشنری نگاشت واژگان بیگانه به برابر پارسی

find_words_in_text(text: str) → list[str]

یافتن همهٔ واژگان بیگانه در متن (فقط WordDetectorRegex).

پارامترها:: text -- متن ورودی برای تحلیل
بازگشت ها:: فهرست واژگان بیگانه شناسایی‌شده

reload() → None: بارگذاری دوباره واژگان از پایگاه داده.

نمونه کاربرد

from pasban.detector import WordDetector

# راه‌اندازی شناساگر (راه‌اندازی یک‌بار)
detector = WordDetector()

# شناسایی واژگان بیگانه با آمار کامل
text = "این متن شامل کامپیوتر و اینترنت است."
result = detector.detect(text)

# دسترسی به نتایج شناسایی
print(f"واژگان بیگانه: {result.foreign_words}")
# خروجی: ['کامپیوتر', 'اینترنت']

print(f"نگاشت‌ها: {result.words}")
# خروجی: {'کامپیوتر': 'رایانه', 'اینترنت': 'اینترنت'}

print(f"تعداد کل رخدادها: {result.count}")
# خروجی: 2

print(f"واژگان یکتا: {result.unique_count}")
# خروجی: 2

print(f"درصد واژگان بیگانه: {result.foreign_percentage:.2f}%")
# خروجی: 28.57%

# دریافت گزارش پارسی
print(result.to_text)
# خروجی: گزارش کامل به پارسی

# دریافت خلاصه آماری
print(result.to_summary_text)
# خروجی: خلاصه آماری

# یا فقط دیکشنری واژگان را دریافت کنید
words = detector.detect_words(text)
print(words)
# خروجی: {'کامپیوتر': 'رایانه', 'اینترنت': 'اینترنت'}

کاربرد پیشرفته

from pasban.detector import WordDetector, WordDetectorRegex

# شناسایی بدون نرمال‌سازی
result = detector.detect(text, normalize=False)

# شناسایی بدون پالایش زمینه‌ای
result = detector.detect(text, contextual=False)

# شناسایی با غیرفعال کردن هر دو (سریع‌ترین، کم‌موشکافی‌ترین)
result = detector.detect(text, normalize=False, contextual=False)

# مقایسه هر دو موتور
detector_ac = WordDetector()
detector_re = WordDetectorRegex()

result_ac = detector_ac.detect(text)
result_re = detector_re.detect(text)

print(f"آهو-کُراسیک یافت: {result_ac.count} واژه")
print(f"عبارت منظم یافت: {result_re.count} واژه")

# بارگذاری دوباره پایگاه‌داده پس از به‌روزرسانی
detector.reload()

Which Detector Should I Use? | کدام موتور را برگزینم؟

English

Pasban provides two main detection engines:

WordDetector (Aho-Corasick): Extremely fast for large texts and wordlists; slightly less accurate in rare edge cases; recommended for most applications.
WordDetectorRegex: More accurate (especially with complex boundaries) but slower; best for small datasets or when maximum precision is needed.

WordDetector

✅ Recommended for most use cases

Extremely fast (10-20x faster)
Excellent for large-scale processing
Real-time applications
Batch processing
Memory efficient

WordDetectorRegex

🎯 For maximum accuracy

Higher precision
Better boundary detection
Small text processing
Critical accuracy scenarios
Educational/research use

پارسی

پاسبان دو موتور اصلی دارد:

WordDetector (آهو-کُراسیک): برای متن‌های بزرگ بسیار سریع؛ در موارد نادر کمی کم‌موشکافی‌تر؛ برای اکثر کاربردها پیشنهاد می‌شود.
WordDetectorRegex: دقیق‌تر (به‌ویژه در مرزهای پیچیده) اما کندتر؛ برای داده‌های کوچک یا نیاز به موشکافی بیشینه مناسب است.

WordDetector

✅ پیشنهاد برای اکثر کاربردها

بسیار سریع (۱۰-۲۰ برابر سریع‌تر)
عالی برای پردازش در مقیاس بزرگ
کاربردهای بلادرنگ
پردازش دسته‌ای
کارآمد در حافظه

WordDetectorRegex

🎯 برای موشکافی بیشینه

موشکافی بالاتر
تشخیص بهتر مرزها
پردازش متن‌های کوچک
سناریوهای حساس به موشکافی
کاربرد آموزشی/پژوهشی

Performance Benchmark | سنجش کارایی

English

Comprehensive benchmark results on Intel Core i7-8565U (100 iterations):

BigText (1216 chars, 46 foreign words)

Engine	Operation	Average	Min/Max	StdDev
WordDetector	Init	0.086s	0.081s / 0.099s	0.003s
	Detect	0.000650s	0.000562s / 0.000975s	0.000084s
WordDetectorRegex	Init	0.039s	0.035s / 0.186s	0.021s
	Detect	0.012093s	0.011914s / 0.013110s	0.000191s

مهم

WordDetector is ~18.6x faster on large texts!

SmallText (86 chars, 3 foreign words)

Engine	Operation	Average	Min/Max	StdDev
WordDetector	Init	0.084s	0.079s / 0.090s	0.002s
	Detect	0.000054s	0.000050s / 0.000100s	0.000007s
WordDetectorRegex	Init	0.036s	0.034s / 0.040s	0.001s
	Detect	0.000917s	0.000888s / 0.001149s	0.000040s

مهم

WordDetector is ~17x faster on small texts!

PurePersian (94 chars, 0 foreign words)

Engine	Operation	Average	Min/Max	StdDev
WordDetector	Init	0.089s	0.081s / 0.122s	0.009s
	Detect	0.000038s	0.000037s / 0.000070s	0.000005s
WordDetectorRegex	Init	0.037s	0.036s / 0.046s	0.002s
	Detect	0.001151s	0.001115s / 0.001396s	0.000050s

مهم

WordDetector is ~30x faster on pure Persian text!

Performance Summary

WordDetector initialization: ~85ms (one-time cost)
WordDetector detection: 0.04-0.65ms per text
WordDetectorRegex initialization: ~37ms (one-time cost)
WordDetectorRegex detection: 0.9-12ms per text
Speed advantage: WordDetector is 17-30x faster for detection
Accuracy difference: < 2% in most cases

پارسی

نتایج سنجش جامع بر روی Intel Core i7-8565U (۱۰۰ تکرار):

متن بزرگ (۱۲۱۶ نویسه، ۴۶ واژه بیگانه)

موتور	عملیات	میانگین	کمینه/بیشینه	انحراف معیار
WordDetector	راه‌اندازی	0.086s	0.081s / 0.099s	0.003s
	شناسایی	0.000650s	0.000562s / 0.000975s	0.000084s
WordDetectorRegex	راه‌اندازی	0.039s	0.035s / 0.186s	0.021s
	شناسایی	0.012093s	0.011914s / 0.013110s	0.000191s

مهم

WordDetector در متن‌های بزرگ ~۱۸.۶ برابر سریع‌تر است!

متن کوچک (۸۶ نویسه، ۳ واژه بیگانه)

موتور	عملیات	میانگین	کمینه/بیشینه	انحراف معیار
WordDetector	راه‌اندازی	0.084s	0.079s / 0.090s	0.002s
	شناسایی	0.000054s	0.000050s / 0.000100s	0.000007s
WordDetectorRegex	راه‌اندازی	0.036s	0.034s / 0.040s	0.001s
	شناسایی	0.000917s	0.000888s / 0.001149s	0.000040s

مهم

WordDetector در متن‌های کوچک ~۱۷ برابر سریع‌تر است!

متن پارسی سره (۹۴ نویسه، ۰ واژه بیگانه)

موتور	عملیات	میانگین	کمینه/بیشینه	انحراف معیار
WordDetector	راه‌اندازی	0.089s	0.081s / 0.122s	0.009s
	شناسایی	0.000038s	0.000037s / 0.000070s	0.000005s
WordDetectorRegex	راه‌اندازی	0.037s	0.036s / 0.046s	0.002s
	شناسایی	0.001151s	0.001115s / 0.001396s	0.000050s

مهم

WordDetector در متن پارسی سره ~۳۰ برابر سریع‌تر است!

خلاصه کارایی

WordDetector راه‌اندازی: ~۸۵ میلی‌ثانیه (هزینه یک‌بار)
WordDetector شناسایی: ۰.۰۴-۰.۶۵ میلی‌ثانیه به ازای هر متن
WordDetectorRegex راه‌اندازی: ~۳۷ میلی‌ثانیه (هزینه یک‌بار)
WordDetectorRegex شناسایی: ۰.۹-۱۲ میلی‌ثانیه به ازای هر متن
برتری سرعت: WordDetector در شناسایی ۱۷-۳۰ برابر سریع‌تر است
تفاوت موشکافی: در اکثر موارد کمتر از ۲٪

WordRepo | پایگاه واژگان

English

WordRepo manages the database of foreign words and their Persian equivalents. You can access, search, add, remove, or update words.

Constructor

from pasban.db import WordRepo

repo = WordRepo()

Methods

get_all_words(reload: bool = False) → dict[str, str]

Get all words and their Persian equivalents.

پارامترها:: reload -- Force reload from database (default: False)
بازگشت ها:: Dictionary mapping foreign words to Persian equivalents

search_word(search_term: str, limit: int = 5) → list[tuple[str, str]]

Search for a word or its Persian equivalent.

پارامترها:

search_term -- Term to search for
limit -- Maximum number of results (default: 5)

بازگشت ها:

List of tuples (foreign_word, persian_equivalent)

add_word(foreign: str, persian: str) → None

Add a new word to the database.

پارامترها:

foreign -- Foreign (non-Persian) word
persian -- Persian equivalent

remove_word(foreign: str) → None

Remove a word from the database.

پارامترها:: foreign -- Foreign word to remove

update_word(foreign: str, persian: str) → None

Update a word's Persian equivalent.

پارامترها:

foreign -- Foreign word to update
persian -- New Persian equivalent

get_persian(foreign: str) → str

Get the Persian equivalent of a foreign word.

پارامترها:: foreign -- Foreign word
بازگشت ها:: Persian equivalent (or empty string if not found)

Example Usage

from pasban.db import WordRepo

repo = WordRepo()

# Get all words
all_words = repo.get_all_words()
print(f"Total words: {len(all_words)}")
print(f"کامپیوتر -> {all_words.get('کامپیوتر')}")
# Output: کامپیوتر -> رایانه

# Search for a word
results = repo.search_word("کامپیوتر", limit=5)
for foreign, persian in results:
    print(f"{foreign} -> {persian}")
# Output: کامپیوتر -> رایانه

# Get Persian equivalent
persian = repo.get_persian("کامپیوتر")
print(persian)
# Output: رایانه

# Add a new word
repo.add_word("ایمیل", "رایانامه")
print(repo.get_persian("ایمیل"))
# Output: رایانامه

# Update a word
repo.update_word("ایمیل", "پست الکترونیک")
print(repo.get_persian("ایمیل"))
# Output: پست الکترونیک

# Remove a word
repo.remove_word("ایمیل")
print(repo.get_persian("ایمیل"))
# Output: (empty string)

# Search with Persian equivalent
results = repo.search_word("رایانه")
for foreign, persian in results:
    print(f"{foreign} <- {persian}")

پارسی

WordRepo پایگاه دادهٔ واژگان بیگانه و برابر پارسی آن‌ها را مدیریت می‌کند. می‌توانید به واژگان دسترسی داشته باشید، جستجو کنید، بیفزایید، بردارید یا ویرایش کنید.

سازنده

from pasban.db import WordRepo

repo = WordRepo()

متدها

get_all_words(reload: bool = False) → dict[str, str]

دریافت همهٔ واژگان و برابر پارسی آن‌ها.

پارامترها:: reload -- بارگذاری اجباری از پایگاه‌داده (پیش‌فرض: False)
بازگشت ها:: دیکشنری نگاشت واژگان بیگانه به برابر پارسی

search_word(search_term: str, limit: int = 5) → list[tuple[str, str]]

جستجوی واژه یا برابر پارسی آن.

پارامترها:

search_term -- عبارت جستجو
limit -- حداکثر تعداد نتایج (پیش‌فرض: 5)

بازگشت ها:

فهرست تاپل‌ها (واژه_بیگانه، برابر_پارسی)

add_word(foreign: str, persian: str) → None

افزودن واژهٔ تازه به پایگاه‌داده.

پارامترها:

foreign -- واژه بیگانه (غیرپارسی)
persian -- برابر پارسی

remove_word(foreign: str) → None

برداشتن واژه از پایگاه‌داده.

پارامترها:: foreign -- واژه بیگانه برای حذف

update_word(foreign: str, persian: str) → None

ویرایش برابر پارسی یک واژه.

پارامترها:

foreign -- واژه بیگانه برای ویرایش
persian -- برابر پارسی جدید

get_persian(foreign: str) → str

دریافت برابر پارسی یک واژه بیگانه.

پارامترها:: foreign -- واژه بیگانه
بازگشت ها:: برابر پارسی (یا رشته خالی در صورت نیافتن)

نمونه کاربرد

from pasban.db import WordRepo

repo = WordRepo()

# دریافت همه واژگان
all_words = repo.get_all_words()
print(f"تعداد کل واژگان: {len(all_words)}")
print(f"کامپیوتر -> {all_words.get('کامپیوتر')}")
# خروجی: کامپیوتر -> رایانه

# جستجوی واژه
results = repo.search_word("کامپیوتر", limit=5)
for foreign, persian in results:
    print(f"{foreign} -> {persian}")
# خروجی: کامپیوتر -> رایانه

# دریافت برابر پارسی
persian = repo.get_persian("کامپیوتر")
print(persian)
# خروجی: رایانه

# افزودن واژه جدید
repo.add_word("ایمیل", "رایانامه")
print(repo.get_persian("ایمیل"))
# خروجی: رایانامه

# ویرایش واژه
repo.update_word("ایمیل", "پست الکترونیک")
print(repo.get_persian("ایمیل"))
# خروجی: پست الکترونیک

# حذف واژه
repo.remove_word("ایمیل")
print(repo.get_persian("ایمیل"))
# خروجی: (رشته خالی)

# جستجو با برابر پارسی
results = repo.search_word("رایانه")
for foreign, persian in results:
    print(f"{foreign} <- {persian}")

کاربرد پیشرفته

from pasban.db import WordRepo

repo = WordRepo()

# بارگذاری دوباره پایگاه‌داده از دیسک
all_words = repo.get_all_words(reload=True)

# عملیات دسته‌ای
new_words = {
    "وبسایت": "وبگاه",
    "ایمیل": "رایانامه",
    "فایل": "پرونده"
}

for foreign, persian in new_words.items():
    repo.add_word(foreign, persian)

# جستجو با عبارت‌های مختلف
computer_words = repo.search_word("رایانه", limit=10)
print(f"{len(computer_words)} واژه مرتبط با رایانه یافت شد")

from pasban.detector import WordDetector, WordDetectorRegex

# Detect without normalization
result = detector.detect(text, normalize=False)

# Detect without contextual cleaning
result = detector.detect(text, contextual=False)

# Detect with both disabled (fastest, least accurate)
result = detector.detect(text, normalize=False, contextual=False)

# Compare both engines
detector_ac = WordDetector()
detector_re = WordDetectorRegex()

result_ac = detector_ac.detect(text)
result_re = detector_re.detect(text)

print(f"Aho-Corasick found: {result_ac.count} words")
print(f"Regex found: {result_re.count} words")

# Reload database after updates
detector.reload()

DataLoader | بروزرسانی پایگاه واژگان

English

DataLoader handles downloading and updating the Pasban word database from GitHub. It ensures you always have the latest version and manages the local storage path.

Constructor & Usage

from pasban.db.loader import DataLoader

# Initialize database (downloads if missing)
DataLoader.initialize()

# Get local database path
db_path = DataLoader.get_db_path()
print(db_path)

# Check and update if needed
DataLoader.update()

Methods

initialize() → None: Ensure the database exists locally. If not, downloads the latest release automatically.

get_db_path() → Path

Get the path to the local database file. Automatically triggers initialization if missing.

بازگشت ها:: Path object pointing to pasban.db

update(force_update: bool = False) → None

Check for updates and download the latest database if needed.

پارامترها:: force_update -- If True, download the latest release even if the local version is up-to-date (default: False)

_get_lasted_tag() → int | None

Internal method to read the last downloaded release tag from TAG file.

بازگشت ها:: Last stored tag as integer, or None if unavailable

_get_release_data() → dict

Fetch metadata of the latest release from GitHub API.

بازگشت ها:: JSON dictionary with release information

_get_db_url(assets_url: str) → str

Get the direct download URL for pasban.db from release assets.

پارامترها:: assets_url -- GitHub API URL for release assets
بازگشت ها:: Direct browser download URL
برانگیختن:: DatabaseNotFound -- If pasban.db is not found

_download_release(assets_url: str, tag: str) → None

Download the latest database release and update the TAG file.

پارامترها:

assets_url -- GitHub API URL for release assets
tag -- Release tag string

Example Usage

from pasban.db.loader import DataLoader

# Force update the database
DataLoader.update(force_update=True)

# Normal update (only if new version available)
DataLoader.update()

# Get database path for other usage
db_path = DataLoader.get_db_path()
print(f"Database is stored at: {db_path}")

پارسی

بارآور داده مدیریت دانلود و بروزرسانی پایگاه واژگان پاسبان را برعهده دارد. این کلاس اطمینان می‌دهد که همیشه تازه‌ترین نسخه پایگاه داده روی دستگاه شما موجود باشد و مسیر نگهداری محلی را مدیریت می‌کند.

سازنده و نمونه کاربرد

from pasban.db.loader import DataLoader

# اطمینان از وجود پایگاه داده (دانلود در صورت نبود)
DataLoader.initialize()

# دریافت مسیر پایگاه داده
db_path = DataLoader.get_db_path()
print(db_path)

# بررسی و بروزرسانی در صورت نیاز
DataLoader.update()

متدها

initialize() → None: اطمینان از موجود بودن پایگاه داده به‌صورت محلی. در صورت نبود، تازه‌ترین نسخه را خودکار بارگیری می‌کند.

get_db_path() → Path

مسیر فایل پایگاه داده محلی را بازمی‌گرداند. در صورت نبود پایگاه داده، دانلود اولیه اجرا می‌شود.

بازگشت ها:: شیء Path که به pasban.db اشاره دارد

update(force_update: bool = False) → None

بررسی بروزرسانی‌ها و بارگیری تازه‌ترین نسخه پایگاه داده در صورت نیاز.

پارامترها:: force_update -- اگر True باشد، همیشه تازه‌ترین نسخه بارگیری می‌شود حتی اگر نسخه محلی به‌روز باشد (پیش‌فرض: False)

_get_lasted_tag() → int | None

کردار درونی برای خواندن آخرین شماره نسخه دانلود شده از پرونده TAG.

بازگشت ها:: آخرین شماره نسخه به‌صورت عدد یا None در صورت نبود

_get_release_data() → dict

دریافت داده‌های نسخه تازه از سرویس GitHub.

بازگشت ها:: دیکشنری JSON شامل داده‌های نسخه

_get_db_url(assets_url: str) → str

دریافت نشانی مستقیم بارگیری pasban.db از داده‌های نسخه.

پارامترها:: assets_url -- نشانی API گیت‌هاب برای داده‌های نسخه
بازگشت ها:: نشانی دانلود مستقیم
برانگیختن:: DatabaseNotFound -- در صورت نبود پرونده pasban.db

_download_release(assets_url: str, tag: str) → None

بارگیری تازه‌ترین نسخه پایگاه داده و بروزرسانی پرونده TAG.

پارامترها:

assets_url -- نشانی API گیت‌هاب برای داده‌ها
tag -- شماره نسخه

نمونه کاربرد

from pasban.db.loader import DataLoader

# بروزرسانی اجباری پایگاه داده
DataLoader.update(force_update=True)

# بروزرسانی معمولی (تنها در صورت وجود نسخه تازه)
DataLoader.update()

# دریافت مسیر پایگاه داده برای کاربردهای دیگر
db_path = DataLoader.get_db_path()
print(f"پایگاه داده در مسیر: {db_path}")

Normalizer | نرمال‌ساز

English

Normalize Persian text and remove non-standard characters and punctuation. The normalizer converts Arabic characters to Persian equivalents and ensures consistent text representation.

Methods

WordNormalizer.normalize_text(text: str) → str

Normalize Persian text by converting Arabic characters to Persian equivalents and removing non-standard characters.

پارامترها:: text -- Input text to normalize
بازگشت ها:: Normalized text

Normalizations applied:

Arabic ك (U+0643) → Persian ک (U+06A9)
Arabic ي (U+064A) → Persian ی (U+06CC)
Arabic ة (U+0629) → Persian ه (U+0647)
Zero-width characters removed
Diacritics removed
Multiple spaces collapsed to single space

Example Usage

from pasban.normalizer.text_normalizer import WordNormalizer

# Basic normalization
text = "این متن شامل ك، ي و ة است!"
normalized = WordNormalizer.normalize_text(text)
print(normalized)
# Output: "این متن شامل ک ی ه است"

# Normalize mixed text
mixed_text = "كتاب    در    كتابخانه    است"
normalized = WordNormalizer.normalize_text(mixed_text)
print(normalized)
# Output: "کتاب در کتابخانه است"

# Remove diacritics
text_with_diacritics = "مَثَلاً کِتابِ خوبی است"
normalized = WordNormalizer.normalize_text(text_with_diacritics)
print(normalized)
# Output: "مثلا کتاب خوبی است"

When to use normalization

Before processing any Persian text
When comparing Persian strings
Before storing text in databases
When preparing text for machine learning
Recommended to always use with detectors (enabled by default)

پارسی

نرمال‌سازی متن پارسی و زدودن نویسه‌های بیگانه و نشانه‌گذاری. نرمال‌ساز نویسه‌های تازی را به برابر پارسی تبدیل می‌کند و بازنمایی یکسان متن را تضمین می‌کند.

متدها

WordNormalizer.normalize_text(text: str) → str

نرمال‌سازی متن پارسی با تبدیل نویسه‌های تازی به برابر پارسی و حذف نویسه‌های غیراستاندارد.

پارامترها:: text -- متن ورودی برای نرمال‌سازی
بازگشت ها:: متن نرمال‌شده

نرمال‌سازی‌های اعمال‌شده:

ك تازی (U+0643) ← ک پارسی (U+06A9)
ي تازی (U+064A) ← ی پارسی (U+06CC)
ة تازی (U+0629) ← ه پارسی (U+0647)
حذف نویسه‌های پهنای صفر
حذف اعراب
فشرده‌سازی فاصله‌های چندگانه به یک فاصله

نمونه کاربرد

from pasban.normalizer.text_normalizer import WordNormalizer

# نرمال‌سازی ساده
text = "این متن شامل ك، ي و ة است!"
normalized = WordNormalizer.normalize_text(text)
print(normalized)
# خروجی: "این متن شامل ک ی ه است"

# نرمال‌سازی متن مختلط
mixed_text = "كتاب    در    كتابخانه    است"
normalized = WordNormalizer.normalize_text(mixed_text)
print(normalized)
# خروجی: "کتاب در کتابخانه است"

# حذف اعراب
text_with_diacritics = "مَثَلاً کِتابِ خوبی است"
normalized = WordNormalizer.normalize_text(text_with_diacritics)
print(normalized)
# خروجی: "مثلا کتاب خوبی است"

چه زمانی از نرمال‌سازی استفاده کنیم

پیش از پردازش هرگونه متن پارسی
هنگام مقایسه رشته‌های پارسی
پیش از ذخیره متن در پایگاه‌داده
هنگام آماده‌سازی متن برای یادگیری ماشین
توصیه می‌شود همواره با شناساگرها استفاده شود (پیش‌فرض فعال است)

Contextual Cleaner | پالایشگر زمینه‌ای

English

Remove contextual patterns and special combinations from text. The contextual cleaner identifies and removes Persian name patterns, common word combinations, and other context-specific patterns that should not be flagged as foreign words.

Methods

contextual_cleaner.clean_text(text: str) → str

Remove contextual patterns from text.

پارامترها:: text -- Input text to clean
بازگشت ها:: Cleaned text

Patterns removed:

Persian full names (first name + last name)
Common Persian compound words
Persian idioms and expressions
Proper nouns with Persian markers
Date and time expressions

Example Usage

from pasban.normalizer.contextual_remover import contextual_cleaner

# Remove name patterns
text = "حافظ شیرازی شاعر نامدار است."
cleaned = contextual_cleaner.clean_text(text)
print(cleaned)
# Names like "حافظ شیرازی" are removed from detection

# Remove compound words
text = "کتابخانهٔ ملی ایران"
cleaned = contextual_cleaner.clean_text(text)
print(cleaned)

# Complex example with multiple patterns
text = """
رومی مولانا جلال‌الدین محمد بلخی شاعر و عارف بزرگ ایرانی
در قرن هفتم هجری در شهر بلخ متولد شد.
"""
cleaned = contextual_cleaner.clean_text(text)
print(cleaned)

Integration with Detector

from pasban.detector import WordDetector

detector = WordDetector()

text = "حافظ شیرازی و مولانا رومی شاعران بزرگ ایرانی هستند."

# With contextual cleaning (default)
result = detector.detect(text, contextual=True)
print(f"With cleaning: {result.count} foreign words")

# Without contextual cleaning
result = detector.detect(text, contextual=False)
print(f"Without cleaning: {result.count} foreign words")

When to use contextual cleaning

When processing literary or historical texts
When text contains many Persian names
When working with formal Persian writing
Recommended for most use cases (enabled by default)
Disable for maximum detection sensitivity

پارسی

زدودن ساختارهای زمینه‌ای و ترکیب‌های خاص از متن. پالایشگر زمینه‌ای الگوهای نام پارسی، ترکیب‌های رایج واژگان و دیگر الگوهای وابسته به زمینه را که نباید به عنوان واژه بیگانه شناسایی شوند، می‌شناسد و حذف می‌کند.

متدها

contextual_cleaner.clean_text(text: str) → str

زدودن الگوهای زمینه‌ای از متن.

پارامترها:: text -- متن ورودی برای پالایش
بازگشت ها:: متن پالایش‌شده

الگوهای حذف‌شده:

نام‌های کامل پارسی (نام + نام خانوادگی)
واژگان مرکب رایج پارسی
اصطلاحات و عبارات پارسی
اسامی خاص با نشانگرهای پارسی
عبارات تاریخ و زمان

نمونه کاربرد

from pasban.normalizer.contextual_remover import contextual_cleaner

# حذف الگوهای نام
text = "حافظ شیرازی شاعر نامدار است."
cleaned = contextual_cleaner.clean_text(text)
print(cleaned)
# نام‌هایی مانند "حافظ شیرازی" از شناسایی حذف می‌شوند

# حذف واژگان مرکب
text = "کتابخانهٔ ملی ایران"
cleaned = contextual_cleaner.clean_text(text)
print(cleaned)

# نمونهٔ پیچیده با الگوهای چندگانه
text = """
رومی مولانا جلال‌الدین محمد بلخی شاعر و عارف بزرگ ایرانی
در قرن هفتم هجری در شهر بلخ متولد شد.
"""
cleaned = contextual_cleaner.clean_text(text)
print(cleaned)

یکپارچگی با شناساگر

from pasban.detector import WordDetector

detector = WordDetector()

text = "حافظ شیرازی و مولانا رومی شاعران بزرگ ایرانی هستند."

# با پالایش زمینه‌ای (پیش‌فرض)
result = detector.detect(text, contextual=True)
print(f"با پالایش: {result.count} واژه بیگانه")

# بدون پالایش زمینه‌ای
result = detector.detect(text, contextual=False)
print(f"بدون پالایش: {result.count} واژه بیگانه")

چه زمانی از پالایش زمینه‌ای استفاده کنیم

هنگام پردازش متن‌های ادبی یا تاریخی
زمانی که متن شامل نام‌های پارسی زیادی است
هنگام کار با نوشتار رسمی پارسی
برای اکثر کاربردها توصیه می‌شود (پیش‌فرض فعال است)
برای حساسیت بیشینه شناسایی، غیرفعال کنید

Core Data Types | انواع داده پایه

DetectData

English

Container for detection results and statistics. This object provides comprehensive information about detected foreign words and text statistics.

Attributes

foreign_words: list[str]: List of detected foreign words (may contain duplicates for multiple occurrences)

words: dict[str, str]: Mapping of unique foreign words to their Persian equivalents

text: str: Original or processed input text

count: int: Total number of detected foreign word occurrences

unique_count: int: Number of unique detected foreign words

total_words: int: Total number of words in the text

foreign_percentage: float: Percentage of foreign words in the text

to_text: str: Detailed Persian text report

to_summary_text: str: Concise Persian summary report

Example Usage

from pasban.detector import WordDetector

detector = WordDetector()
text = "این متن شامل کامپیوتر و اینترنت و سیستم است."
result = detector.detect(text)

# Access basic information
print(f"Foreign words list: {result.foreign_words}")
# Output: ['کامپیوتر', 'اینترنت', 'سیستم']

print(f"Word mappings: {result.words}")
# Output: {'کامپیوتر': 'رایانه', 'اینترنت': 'اینترنت', 'سیستم': 'سامانه'}

# Access statistics
print(f"Total occurrences: {result.count}")
# Output: 3

print(f"Unique words: {result.unique_count}")
# Output: 3

print(f"Total words in text: {result.total_words}")
# Output: 9

print(f"Foreign percentage: {result.foreign_percentage:.2f}%")
# Output: 33.33%

# Get reports
print("Detailed report:")
print(result.to_text)
# Output: واژگان بیگانه یافت‌شده: کامپیوتر (رایانه), اینترنت (اینترنت), ...

print("\nSummary:")
print(result.to_summary_text)
# Output: 3 واژه بیگانه از 9 واژه (33.33٪)

Using DetectData effectively

Use foreign_words for processing individual occurrences
Use words for unique word mappings
Use count vs unique_count to detect repetition
Use foreign_percentage for quality metrics
Use to_text for user-facing reports
Use to_summary_text for dashboards/statistics

پارسی

ظرف نتایج شناسایی و آمارها. این شیء اطلاعات جامعی دربارهٔ واژگان بیگانه شناسایی‌شده و آمار متن فراهم می‌کند.

ویژگی‌ها

foreign_words

Type:: list[str]

فهرست واژگان بیگانه شناسایی‌شده (ممکن است برای رخدادهای چندگانه تکراری باشد)

words

Type:: dict[str, str]

نگاشت واژگان بیگانه یکتا به برابرهای پارسی آن‌ها

text: str: متن ورودی اصلی یا پردازش‌شده

count: int: تعداد کل رخدادهای واژگان بیگانه

unique_count: int: تعداد واژگان بیگانه یکتا

total_words: int: تعداد کل واژگان در متن

foreign_percentage: float: درصد واژگان بیگانه در متن

to_text: str: گزارش متنی تفصیلی به پارسی

to_summary_text: str: گزارش خلاصه به پارسی

نمونه کاربرد

from pasban.detector import WordDetector

detector = WordDetector()
text = "این متن شامل کامپیوتر و اینترنت و سیستم است."
result = detector.detect(text)

# دسترسی به اطلاعات پایه
print(f"فهرست واژگان بیگانه: {result.foreign_words}")
# خروجی: ['کامپیوتر', 'اینترنت', 'سیستم']

print(f"نگاشت واژگان: {result.words}")
# خروجی: {'کامپیوتر': 'رایانه', 'اینترنت': 'اینترنت', 'سیستم': 'سامانه'}

# دسترسی به آمارها
print(f"تعداد کل رخدادها: {result.count}")
# خروجی: 3

print(f"واژگان یکتا: {result.unique_count}")
# خروجی: 3

print(f"کل واژگان متن: {result.total_words}")
# خروجی: 9

print(f"درصد واژگان بیگانه: {result.foreign_percentage:.2f}%")
# خروجی: 33.33%

# دریافت گزارش‌ها
print("گزارش تفصیلی:")
print(result.to_text)
# خروجی: واژگان بیگانه یافت‌شده: کامپیوتر (رایانه), اینترنت (اینترنت), ...

print("\nخلاصه:")
print(result.to_summary_text)
# خروجی: 3 واژه بیگانه از 9 واژه (33.33٪)

استفاده مؤثر از DetectData

از foreign_words برای پردازش رخدادهای جداگانه استفاده کنید
از words برای نگاشت واژگان یکتا استفاده کنید
از count در مقابل unique_count برای تشخیص تکرار استفاده کنید
از foreign_percentage برای معیارهای کیفیت استفاده کنید
از to_text برای گزارش‌های کاربری استفاده کنید
از to_summary_text برای داشبوردها/آمارها استفاده کنید

AhoCorasickAutomaton

English

Multi-pattern string matching engine using the Aho-Corasick algorithm. This is the core algorithm used by WordDetector for fast detection of multiple patterns simultaneously.

Methods

add_word(word: str) → None

Add a word to the automaton trie.

پارامترها:: word -- Word to add

build_failure_links() → None: Build failure links for fast matching. Must be called after adding all words and before searching.

search(text: str) → List[Tuple[str, int, int]]

Find all pattern matches in the text.

پارامترها:: text -- Text to search in
بازگشت ها:: List of tuples (matched_word, start_position, end_position)

Example Usage

from pasban.core import AhoCorasickAutomaton

# Create automaton
ac = AhoCorasickAutomaton()

# Add patterns
ac.add_word("کامپیوتر")
ac.add_word("اینترنت")
ac.add_word("سیستم")

# Build failure links (required before searching)
ac.build_failure_links()

# Search for patterns
text = "این کامپیوتر به اینترنت متصل است و سیستم عامل دارد."
matches = ac.search(text)

# Process results
for word, start, end in matches:
    print(f"Found '{word}' at position {start}-{end}")
# Output:
# Found 'کامپیوتر' at position 4-12
# Found 'اینترنت' at position 16-23
# Found 'سیستم' at position 32-37

Advanced Usage

from pasban.core import AhoCorasickAutomaton

# Build automaton from word list
words = ["رایانه", "اینترنت", "شبکه", "سامانه"]
ac = AhoCorasickAutomaton()

for word in words:
    ac.add_word(word)

ac.build_failure_links()

# Search and extract context
text = "رایانه به شبکه اینترنت متصل است."
matches = ac.search(text)

for word, start, end in matches:
    # Extract context (10 chars before and after)
    context_start = max(0, start - 10)
    context_end = min(len(text), end + 10)
    context = text[context_start:context_end]
    print(f"{word}: ...{context}...")

Performance characteristics

Time complexity: O(n + m + z) where n=text length, m=total pattern length, z=matches
Space complexity: O(m) for the trie structure
Initialization: O(m) to build the trie and failure links
Optimal for: Multiple patterns, large texts, repeated searches

پارسی

موتور جستجوی چندالگویی با استفاده از الگوریتم آهو-کُراسیک. این الگوریتم هستهٔ اصلی است که WordDetector برای شناسایی سریع چندین الگو به‌طور همزمان از آن استفاده می‌کند.

متدها

add_word(word: str) → None

افزودن واژه به درخت خودکار.

پارامترها:: word -- واژه برای افزودن

build_failure_links() → None: ساخت پیوندهای شکست برای تطبیق سریع. باید پس از افزودن همهٔ واژگان و پیش از جستجو فراخوانی شود.

search(text: str) → List[Tuple[str, int, int]]

یافتن همهٔ تطبیق‌های الگو در متن.

پارامترها:: text -- متن برای جستجو
بازگشت ها:: فهرست تاپل‌ها (واژه_تطبیق‌یافته، موقعیت_آغاز، موقعیت_پایان)

نمونه کاربرد

from pasban.core import AhoCorasickAutomaton

# ساخت خودکار
ac = AhoCorasickAutomaton()

# افزودن الگوها
ac.add_word("کامپیوتر")
ac.add_word("اینترنت")
ac.add_word("سیستم")

# ساخت پیوندهای شکست (لازم پیش از جستجو)
ac.build_failure_links()

# جستجوی الگوها
text = "این کامپیوتر به اینترنت متصل است و سیستم عامل دارد."
matches = ac.search(text)

# پردازش نتایج
for word, start, end in matches:
    print(f"'{word}' در موقعیت {start}-{end} یافت شد")
# خروجی:
# 'کامپیوتر' در موقعیت 4-12 یافت شد
# 'اینترنت' در موقعیت 16-23 یافت شد
# 'سیستم' در موقعیت 32-37 یافت شد

کاربرد پیشرفته

from pasban.core import AhoCorasickAutomaton

# ساخت خودکار از فهرست واژگان
words = ["رایانه", "اینترنت", "شبکه", "سامانه"]
ac = AhoCorasickAutomaton()

for word in words:
    ac.add_word(word)

ac.build_failure_links()

# جستجو و استخراج زمینه
text = "رایانه به شبکه اینترنت متصل است."
matches = ac.search(text)

for word, start, end in matches:
    # استخراج زمینه (۱۰ نویسه پیش و پس)
    context_start = max(0, start - 10)
    context_end = min(len(text), end + 10)
    context = text[context_start:context_end]
    print(f"{word}: ...{context}...")

ویژگی‌های کارایی

پیچیدگی زمانی: O(n + m + z) که n=طول متن، m=مجموع طول الگوها، z=تطبیق‌ها
پیچیدگی فضایی: O(m) برای ساختار درخت
راه‌اندازی: O(m) برای ساخت درخت و پیوندهای شکست
بهینه برای: الگوهای چندگانه، متن‌های بزرگ، جستجوهای تکراری

نخستین آشنایی

Introduction | آشنایی 

Installation | نصب 

Modules | ساختار ماژول‌ها 

WordDetector | WordDetectorRegex

Which Detector Should I Use? | کدام موتور را برگزینم؟

Performance Benchmark | سنجش کارایی

WordRepo | پایگاه واژگان

DataLoader | بروزرسانی پایگاه واژگان

Normalizer | نرمال‌ساز

Contextual Cleaner | پالایشگر زمینه‌ای

Core Data Types | انواع داده پایه

DetectData

AhoCorasickAutomaton

Best Practices | نکات حرفه‌ای

Use Cases | موارد کاربرد

FAQ | پرسش‌های متداول

Troubleshooting | عیب‌یابی

API Reference Summary | خلاصه مرجع API

Contributing | مشارکت

License and Credits | مجوز و اعتبارات

نخستین آشنایی

Introduction | آشنایی

Installation | نصب

Modules | ساختار ماژول‌ها

WordDetector | WordDetectorRegex

Which Detector Should I Use? | کدام موتور را برگزینم؟

Performance Benchmark | سنجش کارایی

WordRepo | پایگاه واژگان

DataLoader | بروزرسانی پایگاه واژگان

Normalizer | نرمال‌ساز

Contextual Cleaner | پالایشگر زمینه‌ای

Core Data Types | انواع داده پایه

DetectData

AhoCorasickAutomaton

Best Practices | نکات حرفه‌ای

Use Cases | موارد کاربرد

FAQ | پرسش‌های متداول

Troubleshooting | عیب‌یابی

API Reference Summary | خلاصه مرجع API

Contributing | مشارکت

License and Credits | مجوز و اعتبارات

Introduction | آشنایی 

Installation | نصب 

Modules | ساختار ماژول‌ها 