CollateX: Normalize txt
Feature Requests
Description
Some manuscripts provide their text with diacritics while others don't. As a result, CollateX doesn't recognize words that are actually the same (except for the diacritics) as different and produces false positives. In order to avoid this, we have to strip all diacritics from the text before we save the TXT files.
Code blocks to consider:
- "Tashkil from ISO 8859-6", "Combining maddah and hamza", "Other combining marks", see https://unicode-table.com/en/blocks/arabic/ for the Arabic script
- "Syriac punctuation and signs", Syriac points (vowels)", "Syriac marks" see https://unicode-table.com/en/blocks/syrian/ for Syriac script
This list may be extended.
User Stories
As a scholar I need normalized text in order to get good results with CollateX. Using the texts as they are produces too many errors.
Classification
Is this feature an enhancement of existing code or a completely new feature?
- enhancement
- new feature