· Follow
Published in · 4 min read · Feb 3, 2021
We will discuss spaCy, YAKE, rake-nltk and Gensim for Keyword Extraction Process.
When you wake up in the morning, the first thing you do is open a phone and check messages. Your mind has trained to ignore the WhatsApp messages of those people and groups that you don’t like. You decide the importance of the message by only checking the keywords of people and group name.
Your mind will extract the keywords from WhatsApp group name or contact name and train to like it or ignore it. It also depends on many other factors. The same behavior can be visible while reading articles, watching tv or Netflix series, etc.
Machine learning can mimic the same behavior. It is known as keyword extraction in Natural Language Processing (NLP). So, reading articles or news will depend on extracted keywords such as data science, machine learning, artificial intelligence, etc.
The keyword extraction process not only separates the articles but also helps in saving time on social media platforms. You can take the decision to read the post and comments based on their keywords.
You can check whether your article belongs to a current trend or not. Or your article will trend or not. Just search the extracted keywords on google trends. It is one of the factors, not the only factor.
Every article, post, comment has its own important word that makes them useful or useless. The keyword extraction process identifies those words and categorizes the text data.
In this article, we will go through the python libraries that help in the keyword extraction process.
Those libraries are:
- spaCy
- YAKE
- Rake-Nltk
- Gensim
Let’s start.
SpaCy is all in one python library for NLP tasks. But, we are interested in the keyword extraction functionality of spaCy.
We will start with installing the spaCy library, then download a model en_core_sci_lg. After that, pass the article text into the NLP pipeline. It will return the extracted keywords.
Each model has its own functionality. If an article consists of medical terms, then use the en_core_sci_lg model. Otherwise, you can use the en_core_web_sm model.
Find the related code below.
Observations.
- The output of doc.ents objects could be 1-gram, 2-gram, 3-gram, etc. You can’t control the extraction process based on n-gram and other parameters.
- For text related to medical term use en_core_sci_xx( xx= lg,sm,md) model. It also perform on non-medical term article.
- Load different model using spacy.load() function. Visit sites one and two to know more about these models.
Use the YAKE python library to control the keyword extraction process.
Yet Another Keyword Extractor (Yake) library selects the most important keywords using the text statistical features method from the article. With the help of YAKE, you can control the extracted keyword word count and other features.
Find the related code below.
Observations.
- If you want to extract keywords from a non-English language such as german, then use language=’de’. Mismatch in text language and language variable will give you poorly extracted keywords.
- The max_ngram_size is limit the word count of the extracted keyword. If you keep max_ngram_size=3, then keyword length will not increase more than 3. But, It will also have keywords with a size less than 3.
- The duplication_threshold variable is limit the duplication of words in different keywords. You can set the deduplication_threshold value to 0.1 to avoid the repetition of words in keywords. If you set the deduplication_threshold value to 0.9, then repetition of words is allowed in keywords.
Example –
For deduplication_threshold = 0.1
Output will be [ ‘python and cython’, ‘software’, ‘ines’, ‘library is published’].
For deduplication_threshold = 0.9
Output will be [ ‘python and cython’, ‘programming languages python’, ‘ natural language processing’, ‘advanced natural language’, ‘languages python’, ‘language processing’, ‘ines montani’, ‘cython’, ‘advanced natural’, ‘honnibal and ines’, ‘software company explosion’, ‘natural language’, ‘programming languages’, ‘matthew honnibal’, ‘python’, ‘open-source software library’,’company explosion’, ‘spacy’, ‘processing’, ‘written’].
4. A numOfKeywords variable will determine the count of keywords extracted. If numOfKeywords = 20, then the total keyword extracted will be less than and equal to 20.
Other keyword extractor methods that you can test on your data.
You can form a powerful keyword extraction method by combining the Rapid Automatic Keyword Extraction (RAKE) algorithm with the NLTK toolkit. It is known as rake-nltk. It is a modified version of this algorithm. You can know more about rake-nltk here.Install the rake-nltk library using pip install rake-nltk.
Find the related code below.
Observations.
- A keyword_extracted variable holds the ranked keyword data. To restrict the keywords count, you can use the below code.
keyword_extracted = rake_nltk_var.get_ranked_phrases()[:5]
2. Rake-nltk performance is comparable to spacy.
Gensim is primarily developed for topic modeling. Over time, Gensim added other NLP tasks such as summarization, finding text similarity, etc. Here we will demonstrate the use of Genism for keyword extraction tasks.
Install genism using pip install gensim command.
Find the relevant code below.
Observations:
- Performance of Genism in extracting keywords is still not at the level of spaCy and rake-nltk. There is room for improvement for Genism in the keyword extraction task.
The keyword extraction process helps us in identifying the important words. It also effective in topic modeling tasks. You can know a lot about your text data by only a few keywords. These keywords will help you to determine whether you want to read an article or not.
In this article, I have explained 4 python libraries (spaCy, YAKE, rake-nltk, Gensim) that fetch the keywords from the article or text data. You can also search for other python libraries for a similar task.
Hopefully, this article will help you with your NLP tasks.