Phrasebank from PDF#

This notebook has the purpose of extracting the most common phrases from the training data.

E.g. a academic phrasebank from a poupular scientific writing guidebooks, or a high level scientific journal.

Workflows#

Step 1: Load the data#

[1]:
from openphrasebank import extract_text_from_pdf, clean_text

pdf_path = r"../../data/Academic_Phrasebank.pdf"

# skip the cover and the last two page
text = extract_text_from_pdf(pdf_path, skip_first=6, skip_last=2)
cleaned_text = clean_text(text)

[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Step 2: Extract the phrases#

[2]:
import spacy
from openphrasebank import extract_verb_phrases, extract_expanded_noun_phrases, is_valid_phrase
# Using English language pre-trained model from spaCy, visit for models in other language https://spacy.io/models
#! python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")
doc = nlp(cleaned_text)
# match with the verb and noun phrases patterns
verb_phrases = extract_verb_phrases(doc)
expanded_noun_phrases = extract_expanded_noun_phrases(doc)

Step 3: Filter and export#

[3]:

# Combine lists and remove duplicates
combined_phrases = set(expanded_noun_phrases + verb_phrases)

# sort
sorted_phrases = sorted({phrase for phrase in combined_phrases if 1 < len(phrase.split(' ')) < 5 and len(phrase) > 2 and is_valid_phrase(phrase)})

[5]:
### Step 3: Save the data
import re
# Write the sorted phrases to a Markdown file
with open('../../phrasebanks/academic_phrasebank.md', 'w') as file:
    for phrase in sorted_phrases:
        cleaned_phrase = re.sub(r'\n*', '', phrase)
        file.write(cleaned_phrase + '\n')