Phrasebank from Elsevier corpus [1]#
This notebook has the purpose of extracting the most common phrases from the training data.
E.g. phrasebank_pdf: generate a academic phrasebank from a poupular scientific writing guidebooks, or a high level scientific journal.
E.g. phrasebank_elsevier: generate a academic phrasebank from Elsevier OA CC-BY corpus.
Workflows#
Step 1: Load the data#
[1]:
from openphrasebank import load_and_tokenize_data
# (1) the first time it might take a while to download/tokenize the data (up to half an hour!)
# (2) Using 'ENVI','EART' subject. If not set it will use all subject areas.
tokens_gen = load_and_tokenize_data(dataset_name="orieg/elsevier-oa-cc-by",
subject_areas=['EART','ENVI'],
keys=['title', 'abstract','body_text'],
save_cache=True,
cache_file='temp_tokens.json')
[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data] Package punkt is already up-to-date!
Now
Processing title: 100%|██████████| 6114/6114 [00:00<00:00, 7357.99it/s]
Now
Processing abstract: 100%|██████████| 6114/6114 [00:09<00:00, 636.35it/s]
Now
Processing body_text: 100%|██████████| 1357851/1357851 [04:28<00:00, 5058.60it/s]
Step 2: generate n-grams#
[1]:
from openphrasebank import tokens_generator, generate_multiple_ngrams, filter_frequent_ngrams
# Define the n values for which you want to calculate n-grams
n_values = [1,2,3,4,5,6,7,8]
tokens_gen = tokens_generator('temp_tokens.json')
# Generate the n-grams and count their frequencies
ngram_freqs = generate_multiple_ngrams(tokens_gen, n_values)
[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data] Package punkt is already up-to-date!
Step 3: Filter and export#
[2]:
# Define the top limits for each n-gram length
top_limits = {1: 2000, 2: 2000, 3: 1000, 4: 300, 5: 200, 6: 200, 7: 200, 8: 200}
# Filter the frequent n-grams and store the results in a dictionary
phrases = {}
freqs = {}
for n, limit in top_limits.items():
phrases[n], freqs[n] = filter_frequent_ngrams(ngram_freqs[n], limit,min_freq=20)
# Combine and sort the phrases from n-gram lengths 2 to 7
sorted_phrases = sorted(sum((phrases[n] for n in range(2, 8)), []))
# Write the sorted phrases to a Markdown file
#with open('../../phrasebanks/elsevier_phrasebank_ENVI_EART.txt', 'w') as file:
# for line in sorted_phrases:
# file.write(line + '\n')
Step 4: Visualization#
[4]:
keywords = ["climate"]
# Convert all strings to lower case
lowercase_strings = [s.lower() for s in sorted_phrases]
[10]:
from openphrasebank import display_word_tree
# Example usage with actual data
js_code = display_word_tree(lowercase_strings, keywords[0])
[12]:
with open("../_static/wordtree_climate_geo.html", 'w') as file:
file.write(js_code)