Phrasebank from Elsevier corpus [1]#

This notebook has the purpose of extracting the most common phrases from the training data.

E.g. phrasebank_pdf: generate a academic phrasebank from a poupular scientific writing guidebooks, or a high level scientific journal.
E.g. phrasebank_elsevier: generate a academic phrasebank from Elsevier OA CC-BY corpus.

Workflows#

Step 1: Load the data#

[1]:

from openphrasebank import load_and_tokenize_data

# (1) the first time it might take a while to download/tokenize the data (up to half an hour!)
# (2) Using 'ENVI','EART' subject. If not set it will use all subject areas.
tokens_gen = load_and_tokenize_data(dataset_name="orieg/elsevier-oa-cc-by",
                                    subject_areas=['EART','ENVI'],
                                    keys=['title', 'abstract','body_text'],
                                    save_cache=True,
                                    cache_file='temp_tokens.json')

[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Now

Processing title: 100%|██████████| 6114/6114 [00:00<00:00, 7357.99it/s]

Now

Processing abstract: 100%|██████████| 6114/6114 [00:09<00:00, 636.35it/s]

Now

Processing body_text: 100%|██████████| 1357851/1357851 [04:28<00:00, 5058.60it/s]

Step 2: generate n-grams#

[1]:

from openphrasebank import tokens_generator, generate_multiple_ngrams, filter_frequent_ngrams

# Define the n values for which you want to calculate n-grams
n_values = [1,2,3,4,5,6,7,8]
tokens_gen = tokens_generator('temp_tokens.json')
# Generate the n-grams and count their frequencies
ngram_freqs = generate_multiple_ngrams(tokens_gen, n_values)

[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Step 3: Filter and export#

[2]:

# Define the top limits for each n-gram length
top_limits = {1: 2000, 2: 2000, 3: 1000, 4: 300, 5: 200, 6: 200, 7: 200, 8: 200}

# Filter the frequent n-grams and store the results in a dictionary
phrases = {}
freqs = {}
for n, limit in top_limits.items():
    phrases[n], freqs[n] = filter_frequent_ngrams(ngram_freqs[n], limit,min_freq=20)

# Combine and sort the phrases from n-gram lengths 2 to 7
sorted_phrases = sorted(sum((phrases[n] for n in range(2, 8)), []))

# Write the sorted phrases to a Markdown file
#with open('../../phrasebanks/elsevier_phrasebank_ENVI_EART.txt', 'w') as file:
#    for line in sorted_phrases:
#        file.write(line + '\n')

Step 4: Visualization#

[4]:

keywords = ["climate"]
# Convert all strings to lower case
lowercase_strings = [s.lower() for s in sorted_phrases]

[10]:

from openphrasebank import display_word_tree

# Example usage with actual data
js_code = display_word_tree(lowercase_strings, keywords[0])

[12]:

with open("../_static/wordtree_climate_geo.html", 'w') as file:
    file.write(js_code)