The Ebla Digital Archives searching tool, developed as part of the Progetto Sinleqiunnini, has been conceived to support regular expressions, a concise and flexible instrument for identifying strings of text, such as particular characters, words, or patterns of characters.
Search string and Co-occurrences are searching interfaces that work at the “word” level. They rely on a process whose task is that of chopping a stream of text up into useful semantic units (tokens).
Exact match of a word:
Add word boundaries ^ and $, and check on "Advanced Search" in order to match individual words (e.g. ^dingir=-ku-ra$)
Searching words with pre/postposed determinatives:
Preposed determinatives must be followed by =- (e.g. dingir=-ʾà-da; giš=-geštin).
Postposed determinatives must be preceded by -= (e.g. ib-la-=ki; ga-raš-=sar; ʾà-da-um-=túg).
1) With search string one can either search for a “word” (e.g. gú-da-da-númki), or for a portion of it (e.g. da-núm). Requests such as: ì-giš-sag en, composed by two or more tokens, will produce unqualified results or error messages. On the other hand, one can make contextual searches by simply checking on word range and by selecting related options.
2) The co-occurrences engine comes with a less flexible set of options, but it offers a tool to match the occurrences of two or more related strings into a given context. Its output lists only those texts in which all the elements of the user's request do actually appear.
3) Glossary works at a different linguistic level and has been not specifically designed for the Ebla text corpus. Basically, it is a tool for information retrieval applied to lemmatized and translated corpora. The input required is that of the starting value (or characters) of the expected string. It processes the result-set by grouping retrieved data on the basis of their similarity. In case of the Ebla corpus, it can be employed to compare signs sequences in a concise synoptic format and to investigate uncertain sign values by comparing pertinent variants. For instance, by entering “en-na-”, it will provide occurrences of en-na-NI, en-na-NI-NI, en-na-NI|NI, en-na-ni-il, en-na-il.
For more advanced searching procedures, please refer to the following paragraphs as well as to .Text encoding.- to top -
Users can freely choose for a simplified searching method, Plain Text, or for a more advanced pattern matching, Advanced Search (Regexp). Only the latter uses the regular expressions extended functionalities to support pattern-matching operations, whereas the former one has been noticeably simplified for a basic use.
This section summarizes, with examples, some of the main features, special characters and constructs that can be used for the Advanced Search tool. For further details please refer to MySQL Reference Manual: 11.4.2. Regular Expressions and Wikipedia::Regular expression
A regular expression, often called a pattern, is an expression that describes a set of strings. The latter are usually adopted to give a concise description of a set, without the need of listing all elements. For example, the two strings ne-zi-mu and ni-zi-mu can be matched by the pattern " n(e|i)-zi-mu ".
The following operations help to construct regular expressions:
A vertical bar, |, separates alternatives, as in the example above n(e|i)-zi-mu.
Parentheses are used to define the scope and precedence of the operators (among other uses).
For example, ne-zi-mu|ni-zi-mu and n(e|i)-zi-mu are equivalent patterns which both describe the set of ne-zi-mu and ni-zi-mu.
A quantifier after a token (such as a character) or group specifies how often that preceding element is allowed to occur. The most common quantifiers are the question mark ?, the asterisk *, and the plus sign +.
|?||The question mark indicates there is zero or one of the preceding element. For instance the pattern kum? matches both "ku" and "kum".|
|*||The asterisk indicates there are zero or more of the preceding element.|
|+||The plus sign indicates that there is one or more of the preceding element.|
|[ ]||A bracket expression. It matches a single character or characters that are contained within the brackets. For example, [abc] matches a, b, or c. [a-z] specifies a range which matches any lowercase letter from a to z (only ASCII graphemes). These forms can be mixed: [abcx-z] matches a, b, c, x, y, and z, as does [a-cx-z].|
|[^ ]||Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than a, b, or c.|
|^||Matches the starting position within the string. For example, the pattern ^ga matches the string "ga-zi", but it does not match zi-ga|
|$||Matches the ending position of the string. For example, the pattern ga$ matches only those strings like "zi-ga", but not ga-zi.|
All these constructions can be freely combined to form arbitrarily complex expressions, much like one can construct arithmetical expressions from numbers and the operations +, −, ×, and ÷.
4) Character classes.
Sinleqiunnini searching tool uses the following classes or categories of characters:
|[:blank:]||[ \t]||Space and tab|
|[:digit:]||[0-9]||Digits (NB.: subscript numbers are special characters)|
|[:space:]||[ \t\r\n\v\f]||Whitespace characters|
Unicode is a character set that aims at defining all characters and glyphs from all human scripts, but unfortunately it brings its own requirements and pitfalls when it comes to regular expressions (for an exhaustive overview, please refer to Unicode Regular Expressions). A few examples will suffice here to introduce the reader to some of the basic problems.
All the texts of the Ebla Digital Archives have been encoded in pure utf-8 (Unicode), so that some of the characters (glyphs) do not belong to the strict ASCII range. As a consequence, classes and quantifiers might not work as expected. For instance, by searching for any "alphabetic characters" ( [A-Za-z] or [:alpha:] ) characters such as š or á are omitted from the resultset. Classes and quantifiers only match pattern belonging to the visible set of ASCII, thus it is necessary to modify the pertinent query by adjoining charachters needed: e.g. [a-z] => [a-zšá].
The following synopsis displays Unicode characters employed in the Ebla Digital Archives:
|1.||⸢ ⸣||half brackets|
|2.||×||inclusion (e.g. KA×ME)|
|3.||₀ ₁ ₂ ₃ ₄ ₅ ₆ ₇ ₈ ₉ ᵪ||subscripts (e.g. ru₁₂, DU₁₀, sal₄, but RÚ)|
|4.||ʾ á à Á À é è É È í ì Í Ì ú ù Ú Ù š Š||simple glyphs|
|5.||Ⅰ Ⅱ Ⅲ Ⅳ Ⅴ Ⅵ Ⅶ Ⅷ Ⅸ Ⅹ||roman numbers, used for items composed by multiple parts (e.g. íb-III "triple belt")|
By adopting the advanced pattern matching [Advanced Search], some of the meta characters of regular expression engine may overlap character values of transliterated texts. For instance, the ligature sign + (e.g. ŠE+GÌR or 20+1/2) will be interpreted as one or more occurrences of the preceding sign (sub B.2.3). In these cases, it suffices to "escape" the character: a double backslash (\\) must precede any meta character of the expression (e.g. \\+ => ŠE\\+GÌR).
- .Text encoding
- .How to
- .Escape sequence