There are two main stages to the clustering process:
IDOL Server builds seeds when you send the ClusterSnapshot
action. IDOL Server takes a sample of the documents that it stores, and tries to associate individual documents with each other, based on the similarity of the concepts that the documents contain. Each group produced at this stage, containing a sample document and similar documents, is a seed.
IDOL Server stops trying to build a seed when the seed meets the requirements that SeedSize
specifies or when there are no more documents that meet the similarity requirement that SeedBindLevel
specifies (whichever condition is reached first). IDOL Server discards any seeds that do not reach the required size.
The number of clusters that you specify with NumClusters
affects the number of sample documents that IDOL Server tries to create seeds from. You can adjust the relationship between the number you specify here and the size of the sample used by changing the value of StartingSuggestOverrideFactor
.
IDOL Server groups seeds into clusters when you send the ClusterSGDataGen
or ClusterCluster
actions. IDOL Server tries to create clusters by grouping seeds together. The grouping is based on the similarity of the concepts that the seeds or clusters contain.
Clustering is complete when one of the following conditions is met:
IDOL Server creates the number of clusters specified by NumClusters
.
IDOL Server cannot create any more clusters that meet the similarity requirement specified by BindLevel
.
IDOL Server discards clusters that do not meet the quality requirement set by BindLevel
or the size requirement set by MinClusterDocs
.
For details of the clustering actions, and the settings you can make to generate the clusters from your data, refer to the IDOL Server Reference.
The ideal values for the parameters that affect clustering depend on the nature and amount of data in your IDOL Server. You can use the SentientClustering
parameter for the ClusterSnapshot
action to automatically determine the correct values for SeedSize
and SeedBindLevel
.
This section makes general recommendations about how to manually alter these parameters according to your data. Parameters are closely interdependent, so make these changes in combination with each other (rather than just changing one of the settings). Change values in small steps.
Although you can make many changes to clustering, the number and size of clusters that IDOL Server can identify depends ultimately on the data content that it contains. You can:
Cluster a Small Amount of Data
If your IDOL Server has a small amount of data, it is likely to identify fewer clusters, because it is less likely that your data contains a lot of similar documents for several different topics. You can edit the following parameters to change clustering in this situation.
Note: Ideally, your IDOL Server must contain at least 500 documents.
Cluster a Large Amount of Data
If your IDOL Server has a large amount of data, you probably do not need to edit any clustering parameters, because this is the situation in which clustering is most successful. In some cases (for example, if your IDOL Server contains more than a million documents), it can be beneficial to alter the following parameter.
If the documents in your IDOL Server contain highly similar concepts, IDOL Server might identify a small number of large clusters. For example, if your IDOL Server contains mostly documents about sports, then you might get one large sports cluster. This situation is a realistic characterization of the data in your IDOL Server, but in many circumstances is not useful. You can edit the following parameters to generate smaller, more specific clusters (for example, breaking sports into football, tennis, golf, and so on).
If the documents in your IDOL Server contain a wide variety of concepts, there might not be enough similar documents for IDOL Server to create seeds or clusters that characterize the data that it stores. You can lower the similarity requirement with the following parameters.
It might be the case that although IDOL Server identifies clusters that characterize your data successfully, you want to change the view of the data that clustering creates. The following parameters enable you to change the data view that clusters generate.
|