Quikstart#
You can download the pre-made phrasebank from the table. If you do require a custom one, go forward.
pip install openphrasebank
Get a Self-defined Phrasebank in 3 Steps#
Below is an example based on n-gram frequency. More examples, e.g. extract from PDF, are available in documents.
1️⃣ Load and Tokenize the Data#
import openphrasebank as opb
tokens_gen = opb.load_and_tokenize_data (dataset_name="orieg/elsevier-oa-cc-by",
subject_areas=['PSYC','SOCI'],
keys=['title', 'abstract','body_text'],
save_cache=True,
cache_file='temp_tokens.json')
2️⃣ Generate N-grams#
n_values = [1,2,3,4,5,6,7,8]
opb.generate_multiple_ngrams(tokens_gen, n_values)
3️⃣ Filter and save#
# Define the top limits for each n-gram length
top_limits = {1: 2000, 2: 2000, 3: 1000, 4: 300, 5: 200, 6: 200, 7: 200, 8: 200}
# Filter the frequent n-grams and store the results in a dictionary
phrases = {}
freqs = {}
for n, limit in top_limits.items():
phrases[n], freqs[n] = opb.filter_frequent_ngrams(ngram_freqs[n], limit,min_freq=20)
# Combine and sort the phrases from n-gram lengths 2 to 6
sorted_phrases = sorted(sum((phrases[n] for n in range(2, 7)), []))
# Write the sorted phrases to a Markdown file
with open('../elsevier_phrasebank_PSYC_SOCI.txt', 'w') as file:
for line in sorted_phrases:
file.write(line + '\n')