Phrasebank from PDF#
This notebook has the purpose of extracting the most common phrases from the training data.
E.g. a academic phrasebank from a poupular scientific writing guidebooks, or a high level scientific journal.
Workflows#
Step 1: Load the data#
[1]:
from openphrasebank import extract_text_from_pdf, clean_text
pdf_path = r"../../data/Academic_Phrasebank.pdf"
# skip the cover and the last two page
text = extract_text_from_pdf(pdf_path, skip_first=6, skip_last=2)
cleaned_text = clean_text(text)
[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data] Package punkt is already up-to-date!
Step 2: Extract the phrases#
[2]:
import spacy
from openphrasebank import extract_verb_phrases, extract_expanded_noun_phrases, is_valid_phrase
# Using English language pre-trained model from spaCy, visit for models in other language https://spacy.io/models
#! python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")
doc = nlp(cleaned_text)
# match with the verb and noun phrases patterns
verb_phrases = extract_verb_phrases(doc)
expanded_noun_phrases = extract_expanded_noun_phrases(doc)
Step 3: Filter and export#
[3]:
# Combine lists and remove duplicates
combined_phrases = set(expanded_noun_phrases + verb_phrases)
# sort
sorted_phrases = sorted({phrase for phrase in combined_phrases if 1 < len(phrase.split(' ')) < 5 and len(phrase) > 2 and is_valid_phrase(phrase)})
[5]:
### Step 3: Save the data
import re
# Write the sorted phrases to a Markdown file
with open('../../phrasebanks/academic_phrasebank.md', 'w') as file:
for phrase in sorted_phrases:
cleaned_phrase = re.sub(r'\n*', '', phrase)
file.write(cleaned_phrase + '\n')