In this blog, I’ll discuss how to create custom dictionaries in SAP HANA. To implement certain custom use cases, customers have to implement their own dictionaries for performing Text Analysis.
Use Case: A company has ‘n’ number of products in product portfolio which are not covered completely by standard configuration. In such case, they can create a custom dictionary with an entity category Product and add all the products names in the portfolio.
SAP HANA is shipped with several predefined, standard text analysis configurations. Such configurations are available in “sap.hana.ta.config” repository package as shown in Figure 1. For more details on standard Text Analysis Configurations, refer to the blog [
https://blogs.sap.com/2018/02/01/sap-hana-text-analysis-3/ ].
Figure 1: Standard Text Analysis Configurations
Steps involved in the implementation of Custom dictionaries
- Create a Custom dictionary
- Update the Configuration File or create a Custom Configuration File specifying Custom Dictionary
***********************************************************************************
Step 1: Create Custom Dictionary
Dictionary contains a number of user-defined entity types, each of which further contain any number of entities of standard and variant types. In simple terms, dictionary stores name variations in a structured manner to be accessible through the extraction process. Dictionaries are language-independent, and can be created for all 34 supported languages.
Dictionary files must be in XML format and follow the specified syntax below:
<?xml version="1.0" encoding="UTF-8" ?>
<dictionary xmlns="http://www.sap.com/ta/4.0">
<entity_category name=“<Category_Name">
<entity_name standard_form=“<Entity_Name">
<variant name=“Variant_Name"/>
</entity_name>
</entity_category> ...
</dictionary>
Three parameters that need to be specified while creating the dictionary:
- Category name
- Standard form of an entity: This is complete or precise form of a given entity
- Variant names for an entity: This is less standard form of a given entity
Figure 2 below shows the custom dictionary created for performing Custom Text Analysis.
Figure2: Custom Dictionary File
***********************************************************************************
Step 2: Update the Configuration File or create a Custom Configuration File specifying Custom Dictionary
Custom Text Analysis Configurations can be used to perform custom text analysis using custom text analysis dictionaries and extraction rule set. Create your own custom text analysis configuration files with “.hdbtextconfig” file extension. Configuration files are also in XML format.
Below is a Piece of Code that shows the sequence of Text Analysis Steps in XML Format.
<configuration name=“…AggregateAnalyzer.Aggregator">
<property name="Analyzers" type="string-list">
<string-list-value>…FormatConversionAnalyzer.FC</string-list-value>
<string-list-value>…StructureAnalyzer.SA</string-list-value>
<string-list-value>…LinguisticAnalyzer.LX</string-list-value>
<string-list-value>…ExtractionAnalyzer.TF</string-list-value>
<string-list-value>….GrammaticalRoleAnalyzer.GRA</string-list-value>
</property> </configuration>
In this configuration section, following analyzers are available:
- “FormatConversionAnalyzer” is used for performing document conversion
- “StructureAnalyzer” is used for de-tagging and language detection. This performs mark-up removal, whitespace normalization and language detection
- “LinguisticAnalyzer” is used to perform Linguistic Analysis which includes tokenization, identification of word base forms (stems) and tagging part of speech
- “ExtractionAnalyzer” is an optional parameter which is used for entity/relation extraction
- „GrammaticalRoleAnalyzer“ is also an optional parameter used to identify functional relationships between elements
In our example, custom text analysis configuration is managed within SAP HANA repository. Figure 3 below shows the property sections highlighted with enabled the custom dictionaries, and inclusion of custom dictionary path.
Figure 3: Custom Configuration File
Dictionary is created in “sap.hana.ta.dict” repository package and Text Analysis configuration is created in “sap.hana.ta.config” repository package as seen in Figure 4 below.
Figure 4: Repository Path
*************************************************************************************
This custom form is used to extract basic entities from the text and entities of interest including people, places, firms, URLs, and other common terms.
CREATE COLUMN TABLE "EXT_CORE"
( ID INTEGER PRIMARY KEY,
STRING NVARCHAR(200) );
INSERT INTO "EXT_CORE" VALUES (1, 'Ruby likes working at SAP');
INSERT INTO "EXT_CORE" VALUES (2, 'Rohan dislikes soccer');
INSERT INTO "EXT_CORE" VALUES (3, 'Rohan really likes football');
INSERT INTO "EXT_CORE" VALUES (4, 'Australia won 74 Gold in Commonwealth Games India');
INSERT INTO "EXT_CORE" VALUES (5, 'India won 38 Gold in Games 2010');
CREATE FULLTEXT INDEX EXT_CORE_INDEX ON "EXT_CORE" ("STRING")
CONFIGURATION 'sap.hana.ta.config::Cust_Extraction_Core'
TEXT ANALYSIS ON;
SELECT * FROM "$TA_EXT_CORE_INDEX"
Figure 5 below shows the rule as Entity Extraction in TA_RULE column with new category names and available basic entities.
Figure5: Custom Configuration – Entity Extraction
In summary, we covered detailed steps on how to create and implement custom dictionaries in SAP HANA for performing Text Analysis in certain custom use cases.