The sentiment analysis grammar files contain dictionaries of types of word (for example, positive adjective, negative noun, neutral adverb, and so on), and patterns that describe how to combine these dictionaries to form positive and negative phrases.
For example, you could run sentiment extraction using the English sentiment grammar file (sentiment_eng.ecr
), with the following hotel review as the input file:
The room was nice enough, with a plug in radiator, tv with an English news channel, hot shower, comfy bed. The receptionist we first dealt with was miserable and rude, and just grunted at us and rolled her eyes because we were too early for check in having just got off the morning train from Khabarovsk. Fortunately, a younger receptionist with a nice smile appeared, spoke to us helpfully suggesting a few cafes nearby to pass some time, and we tried to forget about the other woman.
Breakfast is terrible. Unidentifiable cordials, gloomy porridge, bread rolls filled with things you don't expect for breakfast, like potato, egg and dill. Don't come here for the breakfast, but for the cost of the room in a city like Vladivostok, the hotel is still decent value for money.
The following is a sample of the output that this produces:
<?xml version="1.0" encoding="UTF-8"?> <MATCHLIST> <DOCUMENT Type="IDOL IDX" ID="Unknown"> <FIELD Name="DRECONTENT"> <FIELD_INSTANCE Value="1"> <MATCH EntityName="sentiment/positive/eng" Offset="7" OffsetLength="5" Score="1.05" NormalizedTextSize="17" NormalizedTextLength="17" OriginalTextSize="17" OriginalTextLength="17"> <ORIGINAL_TEXT>The room was nice</ORIGINAL_TEXT> <NORMALIZED_TEXT>The room was nice</NORMALIZED_TEXT> <COMPONENTS> <COMPONENT Name="TOPIC" Text="The room" Offset="0" OffsetLength="0" TextSize="8" TextLength="8"/> <COMPONENT Name="SENTIMENT" Text="nice" Offset="13" OffsetLength="13" TextSize="4" TextLength="4"/> </COMPONENTS> </MATCH> <MATCH EntityName="sentiment/negative/eng" Offset="494" OffsetLength="492" Score="1.2" NormalizedTextSize="21" NormalizedTextLength="21" OriginalTextSize="21" OriginalTextLength="21"> <ORIGINAL_TEXT>Breakfast is terrible</ORIGINAL_TEXT> <NORMALIZED_TEXT>Breakfast is terrible</ NORMALIZED_TEXT> <COMPONENTS> <COMPONENT Name="TOPIC" Text="Breakfast" Offset="0" OffsetLength="0" TextSize="9" TextLength="9"/> <COMPONENT Name="SENTIMENT" Text="terrible" Offset="13" OffsetLength="13" TextSize="8" TextLength="8"/> </COMPONENTS> </MATCH> </FIELD_INSTANCE> </FIELD> </DOCUMENT> </MATCHLIST>
The following example configuration shows the recommended usage:
[Eduction] ResourceFiles=grammars/sentiment_eng.ecr // Note: replace sentiment_eng.ecr by sentiment_user_eng.ecr if using user modification // standard entities for all sentiment analysis in English: Entity0=sentiment/positive/eng Entity1=sentiment/negative/eng EntityField0=POSITIVE_VIBE EntityField1=NEGATIVE_VIBE EntityComponentField0=TOPIC,SENTIMENT EntityComponentField1=TOPIC,SENTIMENT // some invalid matches are given very low scores so that we can filter them out: MinScore=0.1 // for extraction of Twitter handles, hashtags and emoticons: TangibleCharacters=@#:; // for displaying metadata: OutputScores=True OutputSimpleMatchInfo=False EnableComponents=True
For more information on the sentiment analysis grammar files, how to adjust the sentiment analysis by extending the grammars, and the features that the sentiment grammars support, refer to IDOL Expert.
The standard sentiment analysis grammars are designed for high precision. For some sources of short comment data, such as YouTube comments, no positive or negative matches are found in some documents despite sentiment clearly being expressed.
If recall with the full sentiment_eng.ecr
grammar file is too low, and your documents are generally short comments, use sentiment_basic_eng.ecr
to extract additional matches.This grammar contains carefully-selected lists of positive and negative terms that help determine the sentiment of a document in which sentiment_eng.ecr
found no matches.
sentiment_basic_eng.ecr
contains terms in title case, but research shows that for most data these impair recall, so these are given a lower score. Micro Focus recommends that you set EntityMinScoreN
to 0.4
to filter out these terms unless you need them.
sentiment_basic_eng.ecr
does not expose TOPIC or SENTIMENT components, and does not use scores to reflect strength or reliability of polarity. The following additional example configuration shows the recommended usage:
[Eduction] ResourceFiles=grammars/sentiment_eng.ecr,grammars/sentiment_basic_eng.ecr // optional further layer of analysis for very short documents: Entity2=sentiment/basic_positive/eng Entity3=sentiment/basic_negative/eng EntityField2=BASIC_POSITIVE_VIBE EntityField3=BASIC_NEGATIVE_VIBE // remove this setting to include basic matches in titlecase - this is not recommended because on most data it decreases precision: EntityMinScore2=0.4 EntityMinScore3=0.4
|