Training Top2vec with embedding_batch_size=256 crashed OS X 12.3.1
tensorflow_macos 2.8.0, tensorflow_metal 0.4.0 Anaconda Python 3.8.5
% pip show tensorflow_macos
WARNING: Ignoring invalid distribution -umpy (/Users/davidlaxer/tensorflow-metal/lib/python3.8/site-packages)
Name: tensorflow-macos
Version: 2.8.0
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /Users/davidlaxer/tensorflow-metal/lib/python3.8/site-packages
Requires: absl-py, astunparse, flatbuffers, gast, google-pasta, grpcio, h5py, keras, keras-preprocessing, libclang, numpy, opt-einsum, protobuf, setuptools, six, tensorboard, termcolor, tf-estimator-nightly, typing-extensions, wrapt
Required-by:
(tensorflow-metal) (base) davidlaxer@x86_64-apple-darwin13 top2vec % pip show tensorflow_metal
WARNING: Ignoring invalid distribution -umpy (/Users/davidlaxer/tensorflow-metal/lib/python3.8/site-packages)
Name: tensorflow-metal
Version: 0.4.0
Summary: TensorFlow acceleration for Mac GPUs.
Home-page: https://developer.apple.com/metal/tensorflow-plugin/
Author:
Author-email:
License: MIT License. Copyright © 2020-2021 Apple Inc. All rights reserved.
Location: /Users/davidlaxer/tensorflow-metal/lib/python3.8/site-packages
Requires: six, wheel
Required-by:
To train the model with embedding_model="universal-sentence-encoder", you'll have to download universal-sentence-encoder_4.
top2vec_trained = Top2Vec(documents=papers_filtered_df.text.tolist(), split_documents=True, **embedding_batch_size=256,** embedding_model="universal-sentence-encoder", use_embedding_model_tokenizer=True, embedding_model_path="/Users/davidlaxer/Downloads/universal-sentence-encoder_4", workers=8)
Here's the project:
https://github.com/ddangelov/Top2Vec
Here's the Jupyter notebook:
https://github.com/ddangelov/Top2Vec/blob/master/notebooks/CORD-19_top2vec.ipynb
You'll have to load the COVID-19 data set from Kaggle here:
https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge
I set filter size to 1,000:
def filter_short(papers_df):
papers_df["token_counts"] = papers_df["text"].str.split().map(len)
papers_df = **papers_df[papers_df.token_counts>1000].reset_index(drop=True)**
papers_df.drop('token_counts', axis=1, inplace=True)
return papers_df
Trace
panic(cpu 8 caller 0xffffff8020d449ad): userspace watchdog timeout: no successful checkins from WindowServer in 120 seconds
service: logd, total successful checkins since wake (127621 seconds ago): 12763, last successful checkin: 0 seconds ago
service: WindowServer, total successful checkins since wake (127621 seconds ago): 12751, last successful checkin: 120 seconds ago
service: remoted, total successful checkins since wake (127621 seconds ago): 12763, last successful checkin: 0
[Trace](https://developer.apple.com/forums/content/attachment/d17c2c9b-569b-4c1a-9c61-892ced7f785b)