Click here to Skip to main content
15,868,016 members
Articles / Programming Languages / C#

Keyword Extraction Based On Entropy Difference

Rate me:
Please Sign up or sign in to vote.
3.25/5 (9 votes)
9 Sep 2013CPOL7 min read 41.6K   4.4K   20   10
Here ,we provide simple and practical keyword extraction software and dll for long text

Please Note 

  If you use the keyword extraction software or dynamic link library (dll) in your program or research, please indicate that the part of paper and program cites the following paper.All the source code in the article can be fully copy. 

l   Zhen YANG, Jianjun LEI, Kefeng FAN, Yingxu LAI. Keyword Extraction by Entropy Difference between the Intrinsic and Extrinsic Mode, Physica A: Statistical Mechanics and its Applications, 392 (2013), 4523-4531. http://dx.doi.org/10.1016/j.physa.2013.05.052   

<o:p>

Introduction      

<o:p>

 Figure1: software interface 

  In this software, we use a kind of entropy difference measure to extract the keywords in a text. It's a simple measure, without any a priori information and effectively extract the keywords in a single text. Here we provide the dll which is developed  by C++ and C# language. We also provide keywords extraction software for you to use which is developed by C#. By using this software, the only thing you need to do is set the path of files to be done, and the software can help you finish the rest of work. For English text, you should ensure that the text is a standard format, or you can use the "pretreatment" function in the software to format the text. Then, you can select one of the methods-general entropy or Maximum Entropy to extract the keywords. 

  For Chinese text, in order to format the text, you should follow the next steps. First, remove the punctuation and charts in the text. Then, divide the sentences into a list of words. Two successive words are separated by a space. Here we provide the function to remove the punctuation and charts. You should ensure that the sentences in the text has been divide into a list of words. Last, you can select one of the methods----general entropy or Maximum Entropy to extract the keywords.<o:p> 

  The Standardization text will be given in the part of usage. With these things, you can easily complete text keyword extraction work!<o:p> 

Background  

<o:p>

  One of the most significant different between human-written texts and monkeys typing is the general existence of meaningful topics in human written texts.keyword/relevant word extraction and ranking are the staring point for critical tasks like topic detection and tracking in written texts,and they are widely applied in information extraction,selection and retrieval. 

Features of the Algorithm     

  As a new method in keyword extraction field, this method has the following highlights :<o:p>

 It’s a new metric to evaluate and rank the relevance of words in a text. <o:p>

 The metric uses the Shannon’s entropy difference between the intrinsic and extrinsic mode. <o:p>

 This work is a new result in keyword extraction and ranking.  <o:p>

 This method is especially suitable for single documents  of which there is no a priori information available.  <o:p> 

Figure 2: Intrinsic mode and extrinsic mode in positions of word-type occurrences in text.

  Here's the brief introduction of the principle of the algorithm, it can help you understand and use the dll and software better. The idea of intrinsic-extrinsic mode is based on the general idea that highly significant words tend to be modulated by the writer’s intension, while common words are essentially uniformly spread throughout a text. So the intrinsic mode denotes the statistical properties of the appearance of a relevant word within a topic, i.e., the statistical properties of clustering within each topic. Meanwhile, the extrinsic mode captures the statistical properties of the disappearance of a word clustering along a written text and it characterizes the relationship between word clustering occurrence within a topic and an author’s written style. As shown in FIG. 2. the distances between two words which is successive occurrences is defined as di = ti+1  ti. Ti is the position of the word in the text. The arrival time difference di belongs to the intrinsic mode if di <μ. In other words, a given occurrence of the word is a part of an intrinsic mode if its local separation is less than its mean waiting time. Let dI = {di|di <μ} be the union set for all di <μas shown in the bottom-left figure in FIG. 2. We found through experiments, that the keyword which appears in the article presents the characteristics of aggregates. so its intrinsic mode entropy is large while its extrinsic mode entropy is small; the general words are evenly distributed in the article, any two consecutive word spacing appears little change, so the entropy difference between the intrinsic and extrinsic mode is small. In this way, you can use the value E which is the entropy difference between the intrinsic and extrinsic mode to extract keywords. In practice, in order to eliminate the words randomly distributed and boundary conditions, we use the C boundary conditions and the normalized entropy difference Enor as the final indicators. If you want to learn more details of this algorithm, please click here (http://dx.doi.org/10.1016/j.physa.2013.05.052) to view the full paper. <u1:p>  

Usage  

Now we will make a detailed description for the keyword extraction software and use of dll. Here are two samples that we will use in numerous examples to illustrate the performance of Enor metric, one is scientific book in English, the other is a news report in Chinese.    <o:p>

Please Note: To start the evaluation of the text, any punctuation symbols were removed from the text, all words were changed to lowercase and then a simple tokenization method based on whitespaces was applied. For the Chinese text, the extra chinese word segmentation would be done at first. <o:p>

 In the keyword extraction software, we also provide you with a text pre-processing function. After pre-processing, the standardized format of text as follows:<o:p> 

 Standardization text input: can any body hear me oh am I talking to myself my mind is running empty In the search for someone else cause tonight I'm feeling like an astronaut……<o:p>

 Using the software:   First, please click the icon to start the keyword extraction software. According to the flow shown in the following figure, you can conduct a keyword extraction process.<o:p>

<o:p>  

  Using the dll

 The using of the dll we provided here is fairly simple, as long as you are familiar with C++ or C# dll calling, you will be able to easily use it.<o:p>

 Using the dll of C++<o:p>

  Please note: Before using the dll of the C++ version, please set you Visual Studio (VS2010) as follows:<o:p>

Open your Visual Studio, click the menu project -> Properties ->Configuration Properties -> C/C++ -> Code Generation -> Runtime library, select Multi-threaded (/MD)<o:p>

 Here provides two versions of the C++ dll, release version of dll and debug version of dll, please select the corresponding dll for use. Such as, using the dll in the "release" folder if you want to compile your code with the solution configuration as “realse” method. We recommend to use the release version because it's faster than the debug version.<o:p>

 Step1:<o:p>

 In the unzipped folder find dll (in the c++ folder) and click the release folder. There will be three file in the folder,as show in the following picture.Then,copy these three files in your project and import the "Node.h" file in your project directory <o:p> 

  Step2:  

  Add the head file “Node.h” in your code as follow#include"Node.h" 

  Please noteThe order of variables in the structure in "Node.h" can not be changed!  Now,we introduce the structure NODE,as follow <o:p> 

C++
#include<vector>
using namespace std;
typedef struct
{
	string word;  //word
	double EDnor; //Entropy difference, the greater the value is, the more critical the word	
	int frequency; //Frequency of the word
	vector<int> t_loc; //the position of the word appeared in the text
	vector<int> d_list; //the distance between two consecutive words
}Node;  //the structure which the dll returns contains key information

#pragma comment (lib,"Keyword_Extraction.lib")
//input: string text ,the text after pretreatment
//int &num ,return the size of the Node array
//return: Node* ,return the keyword array
extern Node* keyword_extra_entropy(string text,int&num);  // return the keyword array with the                                                           //general entropy method
extern Node* keyword_extra_entropy_MAX(string text,int&num);  // return the keyword array with                                                               //the Maximum entropy method 
  The dll encapsulates two functions: Node *keyword_extra_entropy (string text, int & num) and Node *keyword_extra_entropy_MAX (string text, int & num). Respectively, first function use general entropy method, while to second function use maximum entropy method to calculate the maximum entropy. Both of the functions have two inputs: string type - preprocessed text; int type - return the size of the array. output: Node* type - the array of type Node, the structure Node include the content which is introduced above.

  Step3:<o:p>

  After following above steps, you can call the function to get keywords, such as the following code to showing TOP-10 Keywords:  <o:p> 

C++
int i;
int num;
Node *result;
result=keyword_extra_entropy_MAX(text,num);
for(i=0;i<10;i++)
 cout<<endl<<result[i].word<<"==="<<result[i].EDnor<<"==="<<result[i].frequency; <span style="font-family: 'Times New Roman', serif; color: rgb(17, 17, 17); font-size: 14px;"> </span>

  Example:<o:p>

  Now, we select the book "Origin of Species" as an example, and demonstrate the whole process of using the dll:<o:p>

  code  <o:p> 

C++
#include<fstream>
#include<iostream>
#include<string>

#include"Node.h"

void main(){
 
	//read the whole txt file in string type variable
	filebuf *pbuf;  
	ifstream filestr;  
	long size;  
	char * buffer;  
	filestr.open ("D:\\test.txt",ios::binary);  //please change the file
	                                            //path according to your actual situation
	pbuf=filestr.rdbuf();  
	size=pbuf->pubseekoff
		(0,ios::end,ios::in);  
	pbuf->pubseekpos (0,ios::in);  
	buffer=newchar[size+1];  
	pbuf->sgetn (buffer,size);  
	buffer[size]='\0';
        filestr.close();  
	string text=buffer;
 
	//Call the function to
	//extract keywords in the dll
	int num;
	Node *result;
	result=keyword_extra_entropy_MAX(text,num);
 
	//output all keywords in the array,
	//here "num" is the size of the array.
		for(int i=0;i<num;i++)
		cout<<endl<<result[i].word<<"==="<<result[i].EDnor<<"==="<<result[i].frequency;
 
	system("pause");
}  

  Using the dll of C#  

  The dll of C# version packages the entire class, so there contains more functions than the dll of C++ version(including preprocessing functions, etc.). Please refer to file "KEBOED interface documentation" to learn the usage of C# dll.<o:p>

  The results of  experiments<o:p>

  For the English example, here we select "Origin of Species" by using our keyword extraction software, and select the "maximum entropy" keyword extraction method,<o:p> 

 

<o:p>

 For the Chinese sample, we have chosen a news report on the network, the title of this report is《让雷锋精神代代相传》. We use the keyword extraction software, and select the "maximum entropy" keyword extraction method and get the following result:  

<o:p>

<o:p> 

 This two samples text will also be given in the compressed package.<u1:p>  

<o:p>

Conclusion 

<o:p>

  In summary, understanding the complexity of human written text requires an appropriate analysis of the statistical distribution of the words in texts. We find highly significant words tend to be modulated by the writing writer’s intension, while common words are essentially uniformly spread in a text. The ideas of this work can be applied to any natural language with words clearly identified, without requiring any previous knowledge about semantics or syntax.  <o:p> 

<o:p>

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
China China
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionGreat Article! Pin
onelopez17-Jul-15 5:39
onelopez17-Jul-15 5:39 
GeneralMy vote of 2 Pin
den2k8823-Nov-14 23:18
professionalden2k8823-Nov-14 23:18 
GeneralMy vote of 1 Pin
abmv10-Nov-14 2:47
professionalabmv10-Nov-14 2:47 
GeneralMy vote of 5 Pin
Z_Y10-Sep-13 18:41
Z_Y10-Sep-13 18:41 
GeneralMy vote of 1 Pin
Athari10-Sep-13 6:36
Athari10-Sep-13 6:36 
No code.
Code formatting is poor.
Fonts and formatting of the article are a total mess.

But above all - no code.
GeneralMy vote of 1 Pin
skyformat99@gmail.com10-Sep-13 2:44
skyformat99@gmail.com10-Sep-13 2:44 
Questionno code ,no truth Pin
davyy9-Sep-13 23:15
davyy9-Sep-13 23:15 
QuestionSources ? Pin
Davide Zaccanti9-Sep-13 22:11
Davide Zaccanti9-Sep-13 22:11 
GeneralMy vote of 4 Pin
fredatcodeproject9-Sep-13 0:02
professionalfredatcodeproject9-Sep-13 0:02 
GeneralZipfile can not be opened Pin
H.Brydon8-Sep-13 20:26
professionalH.Brydon8-Sep-13 20:26 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.