Skip to content
Snippets Groups Projects
Commit 0c9ebf45 authored by oschmanf's avatar oschmanf
Browse files

Merge branch 'dev-train-models' into 'main'

Dev train models

See merge request !2
parents 49e4d81c a11e862f
No related branches found
No related tags found
1 merge request!2Dev train models
Showing
with 1090 additions and 143 deletions
# Moderation classifier
## Installation
# Installation local
```
python -m venv pp_env
source pp_env/bin/activate
pip install -r requirements.txt
```
# Installation Euler
## Usage
## Tensorflow
### 1. Activation of environment
```
module load gcc/8.2.0 python_gpu/3.10.4 eth_proxy
python -m venv --system-site-packages pp_env_tf_python310
source pp_env_tf_python310/bin/activate
pip install -r requirements.txt
```
# Activation of environment
## Local
```
source pp_env/bin/activate
```
### 2. Preprocessing of dataframe (adding language field)
## On Euler
### TensorFlow
```
srun --pty --mem-per-cpu=3g --gpus=1 --gres=gpumem:12g bash
module load gcc/8.2.0 python_gpu/3.10.4 eth_proxy
source pp_env_tf_python310/bin/activate
```
# Usage
## 1. Preprocessing of dataframe (adding language field)
```
moderation_classifier --prepare_data path_to_csv
```
## 2. Model training
For the model training several option can be chosen:
```
Usage: moderation_classifier [OPTIONS] INPUT_DATA
Run moderation classifier.
:param split_data: Binary flag to specify if data should be split.
:param prepare_data: Binary flag to specify if data should be prepared.
:param text_preprocessing: Binary flag to set text preprocessing.
:param newspaper: Name of newspaper selected for training.
:param topic: Topic selected for training.
:param pretrained_model: Name of pretrained BERT model to use for finetuning.
:param train_mnb: Binary flag to specify whether MNB should be trained.
:param train_bert: Binary flag to specify whether BERT should be trained.
:param eval_mnb: Binary flag to specify whether MNB should be evaluated.
:param eval_bert: Binary flag to specify whether BERT should be evaluated.
:param input_data: Path to input dataframe.
Options:
-s, --split
-p, --prepare_data
-tp, --text_preprocessing
-n, --newspaper TEXT
-t, --topic TEXT
-pm, --pretrained_model TEXT
-tm, --train_mnb
-tb, --train_bert
-em, --eval_mnb
-eb, --eval_bert
-tbto, --train_bert_torch
```
The most important options during training are the model type (MNB or BERT) and the newspaper and topic selected for training.
### MNB
Training for all newspapers and topics is started with the following command:
```
moderation_classifier --train_mnb INPUT_DATA
```
Training for one newspapers (here: tagesanzeiger) and one topic (here: Wissen) is started with the following command:
```
moderation_classifier --newspaper tagesanzeiger --topic Wissen --train_mnb INPUT_DATA
```
After the training is finished a log-file with all relevant information (path to train data, params for filtering, ..) is stored in `saved_models/MNB_logs`. For the evaluation of the training only the path to this log-file is needed. The evaluation of the training run is started with:
```
moderation_classifier --eval_mnb LOG_FILE
```
### BERT
Training for all newspapers and topics is started with the following command:
```
moderation_classifier --text_preprocessing --pretrained_model "bert-base-german-cased" --train_bert INPUT_DATA
```
Training for one newspapers (here: tagesanzeiger) and one topic (here: Wissen) is started with the following command:
```
moderation_classifier --text_preprocessing --pretrained_model "bert-base-german-cased" --newspaper tagesanzeiger --topic Wissen --train_bert INPUT_DATA
```
After the training is finished a log-file with all relevant information (path to train data, params for filtering, ..) is stored in `saved_models/BERT_logs`. For the evaluation of the training only the path to this log-file is needed. The evaluation of the training run is started with:
```
moderation_classifier --eval_bert LOG_FILE
```
#!/bin/bash
module load gcc/8.2.0 python_gpu/3.10.4 eth_proxy
source ../pp_env_tf_python310/bin/activate
sbatch --mem-per-cpu=4g --time=6:00:00 --wrap "moderation_classifier --prepare_data ../data/tamedia_for_classifier_v3.csv"
#!/bin/bash
module load gcc/8.2.0 python_gpu/3.10.4 eth_proxy
source ../pp_env_tf_python310/bin/activate
sbatch --mem-per-cpu=4g --time=6:00:00 --wrap "moderation_classifier --prepare_data ../data/tamedia_for_classifier_v3.csv"
!/bin/bash
module load gcc/8.2.0 python_gpu/3.10.4 eth_proxy
source ../pp_env_tf_python310/bin/activate
sbatch --mem-per-cpu=12g\
--gpus=1\
--gres=gpumem:12g\
--time=30:00:00\
--wrap "moderation_classifier --newspaper tagesanzeiger
--pretrained_model "bert-base-german-cased"
--text_preprocessing
--train_bert data/tamedia_for_classifier_v4_preproc_train.csv"
!/bin/bash
module load gcc/8.2.0 python_gpu/3.10.4 eth_proxy
source ../pp_env_tf_python310/bin/activate
sbatch --mem-per-cpu=12g\
--gpus=1\
--gres=gpumem:12g\
--time=30:00:00\
--wrap "moderation_classifier --newspaper tagesanzeiger
--pretrained_model "bert-base-german-cased"
--text_preprocessing
--hsprob '[0.7,1]'
--train_bert ../data/tamedia_for_classifier_v4_preproc_train.csv"
!/bin/bash
module load gcc/8.2.0 python_gpu/3.10.4 eth_proxy
source ../pp_env_tf_python310/bin/activate
sbatch --mem-per-cpu=12g\
--gpus=1\
--gres=gpumem:12g\
--time=30:00:00\
--wrap "moderation_classifier --newspaper tagesanzeiger
--pretrained_model "bert-base-german-cased"
--text_preprocessing
--hsprob '[0.0,0.3]'
--train_bert ../data/tamedia_for_classifier_v4_preproc_train.csv"
!/bin/bash
module load gcc/8.2.0 python_gpu/3.10.4 eth_proxy
source ../pp_env_tf_python310/bin/activate
sbatch --mem-per-cpu=12g\
--gpus=1\
--gres=gpumem:12g\
--time=30:00:00\
--wrap "moderation_classifier --newspaper tagesanzeiger
--pretrained_model "bert-base-german-cased"
--text_preprocessing
--topic 'Wissen'
--train_bert ../data/tamedia_for_classifier_v4_preproc_train.csv"
!/bin/bash
module load gcc/8.2.0 python_gpu/3.10.4 eth_proxy
source ../pp_env_tf_python310/bin/activate
sbatch --mem-per-cpu=12g\
--gpus=1\
--gres=gpumem:12g\
--time=30:00:00\
--wrap "moderation_classifier --newspaper tagesanzeiger
--pretrained_model "deepset/bert-base-german-cased-hatespeech-GermEval18Coarse"
--text_preprocessing
--train_bert ../data/tamedia_for_classifier_v4_preproc_train.csv"
!/bin/bash
module load gcc/8.2.0 python_gpu/3.10.4 eth_proxy
source ../pp_env_tf_python310/bin/activate
sbatch --mem-per-cpu=12g\
--gpus=1\
--gres=gpumem:12g\
--time=30:00:00\
--wrap "moderation_classifier --newspaper tagesanzeiger
--pretrained_model "deepset/bert-base-german-cased-hatespeech-GermEval18Coarse"
--text_preprocessing
--hsprob '[0.7,1]'
--train_bert ../data/tamedia_for_classifier_v4_preproc_train.csv"
!/bin/bash
module load gcc/8.2.0 python_gpu/3.10.4 eth_proxy
source ../pp_env_tf_python310/bin/activate
sbatch --mem-per-cpu=12g\
--gpus=1\
--gres=gpumem:12g\
--time=30:00:00\
--wrap "moderation_classifier --newspaper tagesanzeiger
--pretrained_model "deepset/bert-base-german-cased-hatespeech-GermEval18Coarse"
--text_preprocessing
--hsprob '[0.0,0.3]'
--train_bert ../data/tamedia_for_classifier_v4_preproc_train.csv"
!/bin/bash
module load gcc/8.2.0 python_gpu/3.10.4 eth_proxy
source ../pp_env_tf_python310/bin/activate
sbatch --mem-per-cpu=12g\
--gpus=1\
--gres=gpumem:12g\
--time=30:00:00\
--wrap "moderation_classifier --newspaper tagesanzeiger
--pretrained_model "deepset/bert-base-german-cased-hatespeech-GermEval18Coarse"
--text_preprocessing
--topic 'Wissen'
--train_bert ../data/tamedia_for_classifier_v4_preproc_train.csv"
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf
import click
import numpy as np
import os
import pandas as pd
from pathlib import Path
from typing import List, Union
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from src.preprocessing_text import TextLoader, TextProcessor
from src.train_logs import load_logs
from src.BERT_utils import predict_batches
from src.eval_utils import gen_scores_dict
@click.argument("train_logs")
def main(train_logs: Union[str, os.PathLike]):
"""
Prepares data and evaluates trained BERT model with TF
:param train_logs: path to csv-file containing train logs
"""
# Load logs
(
path_repo,
path_model,
input_data,
text_preprocessing,
newspaper,
lang,
topic,
hsprob,
remove_duplicates,
min_num_words,
pretrained_model,
) = load_logs(train_logs)
# Load data and extract only text from tagesanzeiger
print("Load and preprocess text")
tl = TextLoader(input_data)
df_de = tl.load_text_csv(
newspaper=newspaper,
lang=lang,
topic=topic,
hsprob=hsprob,
load_subset=False,
remove_duplicates=remove_duplicates,
min_num_words=min_num_words,
)
if text_preprocessing:
tp = TextProcessor()
text_proc = tp.fit_transform(df_de.text)
df_de.text = text_proc
comon_topics = tl.get_comments_per_topic(df_de)
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
model = TFAutoModelForSequenceClassification.from_pretrained(
pretrained_model_name_or_path=path_model
)
# Split text into batches
y_pred_all, y_prob_all = predict_batches(df_de.text.values, model, tokenizer)
import pdb; pdb.set_trace()
# eval all
precision, recall, f1, _ = precision_recall_fscore_support(
df_de.label, y_pred_all, average="weighted"
)
accuracy = accuracy_score(df_de.label, y_pred_all)
results_all = gen_scores_dict(precision, recall, f1, accuracy)
# eval per topic
topics = [t[0] for t in comon_topics]
results_t = dict()
for t in topics:
y_test_t = df_de[df_de.topic == t].label
y_pred_t = y_pred_all[df_de.topic == t]
precision, recall, f1, _ = precision_recall_fscore_support(
y_test_t, y_pred_t, average="weighted"
)
accuracy = accuracy_score(y_test_t, y_pred_t)
results_t[t] = gen_scores_dict(precision, recall, f1, accuracy)
# Compute rejection rate
reject_rate_all = np.round(df_de.label.mean(), 4) * 100
reject_rate_topic = [
np.round(df_de[df_de.topic == k].label.mean(), 4) * 100 for k in topics
]
# Compute number comments
num_comm_all = df_de.shape[0]
num_comm_topic = [df_de[df_de.topic == k].shape[0] for k in topics]
# Save results labels
df_res_all = pd.DataFrame().from_dict(results_all, orient="index", columns=["all"])
df_res_all.loc["rejection rate"] = reject_rate_all
df_res_all.loc["number comments"] = num_comm_all
df_res_topic = pd.DataFrame.from_dict(results_t)
df_res_topic.loc["rejection rate"] = reject_rate_topic
df_res_topic.loc["number comments"] = num_comm_topic
df_res = df_res_all.join(df_res_topic)
df_res.loc["data"] = [input_data] * df_res.shape[1]
df_res.to_csv(
path_repo + "/results/results_eval_BERT/" + Path(path_model).stem + ".csv"
)
# Save results probs
df_prob_all = df_de.copy()
df_prob_all['bert_probability'] = y_prob_all
df_prob_all.to_csv(
path_repo + "/results/results_eval_BERT/" + Path(path_model).stem + "_bert_probability.csv"
)
if __name__ == "__main__":
main()
import click
from collections import Counter
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import precision_recall_fscore_support
from typing import Union
import os
from src.MNB_utils import load_model
from src.preprocessing_text import TextLoader
from src.train_logs import load_logs
from src.eval_utils import gen_scores_dict
@click.argument("train_logs")
def main(train_logs: Union[str, os.PathLike]):
"""
Prepares data and evaluates trained MNB model
:param train_logs: path to csv-file containing train logs
"""
# Load logs
(
path_repo,
path_model,
input_data,
_,
newspaper,
lang,
topic,
remove_duplicates,
min_num_words,
) = load_logs(train_logs)
# Load model
pipe = load_model(path_model)
# Load test data
tl = TextLoader(input_data)
df_test = tl.load_text_csv(
newspaper=newspaper,
lang=lang,
topic=topic,
load_subset=False,
remove_duplicates=remove_duplicates,
min_num_words=min_num_words,
)
X_test = df_test.text
y_test = df_test.label
# Make prediction
y_pred = pipe.predict(X_test)
# Compute scores and add to dict
precision, recall, f1, _ = precision_recall_fscore_support(
y_test, y_pred, average="weighted"
)
accuracy = pipe.score(X_test, y_test)
results_all = gen_scores_dict(precision, recall, f1, accuracy)
# Get results per topic
count_topics = Counter(df_test["topic"]).most_common(10)
topics = [t[0] for t in count_topics]
results_t = dict()
for t in topics:
X_test_t = df_test[df_test.topic == t].text
y_test_t = df_test[df_test.topic == t].label
y_pred_t = pipe.predict(X_test_t)
precision, recall, f1, _ = precision_recall_fscore_support(
y_test_t, y_pred_t, average="weighted"
)
accuracy = pipe.score(X_test_t, y_test_t)
results_t[t] = gen_scores_dict(precision, recall, f1, accuracy)
# Compute rejection rate
reject_rate_all = np.round(df_test.label.mean(), 4) * 100
reject_rate_topic = [
np.round(df_test[df_test.topic == k].label.mean(), 4) * 100 for k in topics
]
# Compute number comments
num_comm_all = df_test.shape[0]
num_comm_topic = [df_test[df_test.topic == k].shape[0] for k in topics]
# Save results
df_res_all = pd.DataFrame().from_dict(results_all, orient="index", columns=["all"])
df_res_all.loc["rejection rate"] = reject_rate_all
df_res_all.loc["number comments"] = num_comm_all
df_res_topic = pd.DataFrame.from_dict(results_t)
df_res_topic.loc["rejection rate"] = reject_rate_topic
df_res_topic.loc["number comments"] = num_comm_topic
df_res = df_res_all.join(df_res_topic)
df_res.loc["data"] = [input_data] * df_res.shape[1]
df_res.to_csv(
path_repo + "/results/results_eval_MNB/" + Path(path_model).stem + ".csv"
)
if __name__ == "__main__":
main()
# imports
from pathlib import Path
import click
from src.preprocessing import DataProcessor
from src.preprocessing_df import DataProcessor
import moderation_classifier.split_data as split_data
import moderation_classifier.train_MNB as train_MNB
import moderation_classifier.train_BERT as train_BERT
import moderation_classifier.eval_MNB as eval_MNB
import moderation_classifier.eval_BERT as eval_BERT
import moderation_classifier.train_BERT_torch as train_BERT_torch
from typing import Union
import os
@click.command()
@click.option('-p', '--prepare_data', is_flag=True)
@click.argument('input_data')
def main(prepare_data: bool, input_data: Union[str, os.PathLike]):
@click.option("-s", "--split", is_flag=True)
@click.option("-p", "--prepare_data", is_flag=True)
@click.option("-tp", "--text_preprocessing", is_flag=True)
@click.option("-n", "--newspaper", default=None)
@click.option("-t", "--topic", default=None)
@click.option("-h", "--hsprob", default=None)
@click.option("-pm", "--pretrained_model", default=None)
@click.option("-tm", "--train_mnb", is_flag=True)
@click.option("-tb", "--train_bert", is_flag=True)
@click.option("-em", "--eval_mnb", is_flag=True)
@click.option("-eb", "--eval_bert", is_flag=True)
@click.option("-tbto", "--train_bert_torch", is_flag=True)
@click.argument("input_data")
def main(
split: bool,
prepare_data: bool,
text_preprocessing: bool,
newspaper: str,
topic: str,
hsprob: list,
pretrained_model: str,
train_mnb: bool,
train_bert: bool,
eval_mnb: bool,
eval_bert: bool,
train_bert_torch: bool,
input_data: Union[str, os.PathLike],
):
"""
Run moderation classifier.
:param split_data: Binary flag to specify if data should be split.
:param prepare_data: Binary flag to specify if data should be prepared.
:param text_preprocessing: Binary flag to set text preprocessing.
:param newspaper: Name of newspaper selected for training.
:param topic: Topic selected for training.
:param hsprob: List with min max values for hate speech probability
:param pretrained_model: Name of pretrained BERT model to use for finetuning.
:param train_mnb: Binary flag to specify whether MNB should be trained.
:param train_bert: Binary flag to specify whether BERT should be trained.
:param eval_mnb: Binary flag to specify whether MNB should be evaluated.
:param eval_bert: Binary flag to specify whether BERT should be evaluated.
:param input_data: Path to input dataframe.
"""
if split:
split_data.main(input_data)
if prepare_data:
dp = DataProcessor(input_data)
dp.add_language()
print(input_data)
print('Prepare data')
print("Prepare data")
if train_mnb:
train_MNB.main(input_data, newspaper, topic)
if train_bert:
if hsprob is None:
pass
else:
hsprob = eval(hsprob)
train_BERT.main(
input_data, text_preprocessing, newspaper, topic, hsprob, pretrained_model
)
if eval_mnb:
eval_MNB.main(input_data)
if eval_bert:
eval_BERT.main(input_data)
if train_bert_torch:
train_BERT_torch.main(input_data)
if __name__ == "__main__":
main()
\ No newline at end of file
main()
from datasets import load_dataset
from evaluate import evaluator
from transformers import pipeline
data = load_dataset("imdb", split="test").shuffle(seed=42).select(range(10))
task_evaluator = evaluator("text-classification")
pipe = pipeline("text-classification", model="../saved_models/20230630-103946/")
eval_results = task_evaluator.compute(
model_or_pipeline=pipe,
data=data,
label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
import pdb; pdb.set_trace()
print(eval_results)
import os
import pandas as pd
from pathlib import Path
from typing import Union
from sklearn.model_selection import train_test_split
def main(input_data: Union[str, os.PathLike]):
"""
Performs train-test split with respect to newspaper count
"""
df = pd.read_csv(input_data)
df_train, df_test = train_test_split(df, test_size=0.3, stratify=df.originTenantId)
path_train = (
Path(input_data)
.parent.joinpath(Path(input_data).stem + "_train")
.with_suffix(".csv")
)
path_test = (
Path(input_data)
.parent.joinpath(Path(input_data).stem + "_test")
.with_suffix(".csv")
)
df_train.to_csv(path_train)
df_test.to_csv(path_test)
if __name__ == "__main__":
main()
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import TFAutoModelForSequenceClassification
from transformers.keras_callbacks import KerasMetricCallback
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import TensorBoard
import click
import datetime
import os
import pandas as pd
from pathlib import Path
import spacy
from typing import Union
from src.preprocessing_text import TextLoader, TextProcessor
from src.prepare_bert_tf import df2dict, compute_metrics, prepare_training
from src.train_logs import save_logs
@click.argument("input_data", required=True)
@click.argument("text_preprocessing", required=False)
@click.argument("newspaper", required=False)
@click.argument("topic", required=False)
@click.argument("pretrained_model", required=True)
def main(
input_data: Union[str, os.PathLike],
text_preprocessing: bool,
newspaper: str,
topic: str,
hsprob: list,
pretrained_model: str,
):
"""
Prepares data and trains BERT model with TF
:param input_data: path to input data
:param text_preprocessing: Binary flag to set text preprocessing.
:param newspaper: Name of newspaper selected for training.
:param topic: Topic selected for training.
:param hsprob: List with min max values for hate speech probability
:param pretrained_model: Name of pretrained BERT model to use for finetuning.
"""
def preprocess_function(examples):
"""
Prepares tokenizer for mapping
"""
return tokenizer(examples["text"], truncation=True)
# Extract path
p = Path(input_data)
p_repo = p.parent.parent
# Load data and extract only text from tagesanzeiger
print("Load and preprocess text")
lang = "de"
remove_duplicates = True
min_num_words = 3
tl = TextLoader(input_data)
df_de = tl.load_text_csv(
newspaper=newspaper,
lang=lang,
topic=topic,
hsprob=hsprob,
load_subset=False,
remove_duplicates=remove_duplicates,
min_num_words=min_num_words,
)
if text_preprocessing:
tp = TextProcessor(lowercase=False)
text_proc = tp.fit_transform(df_de.text)
df_de.text = text_proc
#df_de = df_de.sample(100)
# Prepare data for modeling
ds = df2dict(df_de)
# pretrained_model = "bert-base-german-cased"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
tokenized_text = ds.map(preprocess_function)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
# Training
print("Train model")
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}
optimizer, _ = prepare_training(tokenized_text)
model = TFAutoModelForSequenceClassification.from_pretrained(
pretrained_model, num_labels=2, id2label=id2label, label2id=label2id
)
tf_train_set = model.prepare_tf_dataset(
tokenized_text["train"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
tf_validation_set = model.prepare_tf_dataset(
tokenized_text["test"],
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)
model.compile(optimizer=optimizer)
# Define checkpoint
time_stemp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
path_checkpoint = (p_repo).joinpath("tmp/checkpoint/" + time_stemp)
checkpoint_filepath = path_checkpoint
metric_callback = KerasMetricCallback(
metric_fn=compute_metrics, eval_dataset=tf_validation_set
)
checkpoint_callback = ModelCheckpoint(
checkpoint_filepath,
monitor="val_loss",
save_best_only=True,
save_weights_only=False,
mode="min",
save_freq="epoch",
initial_value_threshold=None,
)
log_dir = "logs/fit/" + time_stemp
tensorboard_callback = TensorBoard(log_dir=log_dir, histogram_freq=1)
callbacks = [metric_callback, checkpoint_callback, tensorboard_callback]
# Fit model
print("Train model")
model.fit(
x=tf_train_set,
validation_data=tf_validation_set,
epochs=5,
verbose=2,
callbacks=callbacks,
)
# Save model
print("Save model")
path_model = (p_repo).joinpath("saved_models/" + time_stemp)
model.save_pretrained(path_model)
tokenizer.save_pretrained(path_model)
# Save model logs
save_logs(
path_repo=p_repo,
path_model=path_model,
input_data=input_data,
text_preprocessing=True,
newspaper=newspaper,
lang=lang,
topic=topic,
hsprob=hsprob,
remove_duplicates=remove_duplicates,
min_num_words=min_num_words,
model_name="BERT",
pretrained_model=pretrained_model,
)
print("Done")
if __name__ == "__main__":
main()
from datasets import Dataset, DatasetDict
import evaluate
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
import pandas as pd
from typing import Union
import os
import click
from sklearn.model_selection import train_test_split
from src.preprocessing_text import TextLoader
def load_text(
path: Union[str, os.PathLike], newspaper: str = "tagesanzeiger", lang: str = "de"
) -> pd.DataFrame:
"""
Loads daraframe and extracts text depending on newspaper and langugae
"""
df = pd.read_csv(path)
df = df.loc[(df.originTenantId == newspaper) & (df.language == lang)]
df = df[["text", "rejected"]]
df = df.rename(columns={"rejected": "label"})
return df
def df2dict(df: pd.DataFrame):
"""
Converts Dataframe into Huggingface Dataset
"""
df = df.sample(10000)
train, test = train_test_split(df, test_size=0.2)
ds_train = Dataset.from_pandas(train)
ds_test = Dataset.from_pandas(test)
ds = DatasetDict()
ds["train"] = ds_train
ds["test"] = ds_test
return ds
def compute_metrics(eval_pred):
accuracy = evaluate.load("accuracy")
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return accuracy.compute(predictions=predictions, references=labels)
def prepare_training(dataset, batch_size: int = 16, num_epochs: int = 5):
"""
Prepares training and sets params
"""
batches_per_epoch = len(dataset["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(
init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps
)
return optimizer, schedule
@click.argument("input_data")
def main(input_data: Union[str, os.PathLike]):
# load data and extract only german text from tagesanzeiger
print("Load text")
tl = TextLoader(input_data)
df_de = tl.load_text_csv(newspaper="tagesanzeiger", load_subset=True)
# Dataframe to dict/Train-test split
ds = df2dict(df_de)
# Preprocessing/Tokenization
print("tokenize")
tokenizer = AutoTokenizer.from_pretrained("bert-base-german-cased")
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True)
# truncate sequences to be no longer than the models maximum input length
print("map")
tokenized_text = ds.map(preprocess_function)
# dynamically padding of sentences to the longest length in a batch
# data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Training
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-german-cased", num_labels=2, id2label=id2label, label2id=label2id
)
training_args = TrainingArguments(
output_dir="my_awesome_model",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_text["train"],
eval_dataset=tokenized_text["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
if __name__ == "__main__":
main()
from sklearn.model_selection import train_test_split
import click
from pathlib import Path
from typing import Union
import os
from src.MNB_utils import create_pipeline, create_path, save_model
from src.preprocessing_text import TextLoader
from src.train_logs import save_logs
@click.argument("input_data")
@click.argument("newspaper")
@click.argument("topic")
def main(input_data: Union[str, os.PathLike], newspaper: str, topic: str):
"""
Runs training of MNB.
:param input_data: Path to input dataframe.
"""
# Extract path
p = Path(input_data)
p_repo = p.parent.parent
# Load data and extract only text from tagesanzeiger
print("Load and preprocess text")
lang = "de"
remove_duplicates = True
min_num_words = 3
tl = TextLoader(input_data)
df_de = tl.load_text_csv(
newspaper=newspaper,
lang=lang,
topic=topic,
load_subset=False,
remove_duplicates=remove_duplicates,
min_num_words=min_num_words,
)
# Prepare data for modeling
text = df_de.text
label = df_de.label
X_train, X_val, y_train, y_val = train_test_split(text, label, stratify=label)
# Training
print("Train model")
pipe = create_pipeline()
pipe.fit(X_train, y_train)
val_score = pipe.score(X_val, y_val)
# Save model and training logs
path = create_path()
save_model(pipe, path)
save_logs(
path_repo=p_repo,
path_model=path,
input_data=input_data,
text_preprocessing=True,
newspaper=newspaper,
lang=lang,
topic=topic,
remove_duplicates=remove_duplicates,
min_num_words=min_num_words,
model_name="MNB",
val_score=val_score,
)
if __name__ == "__main__":
main()
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment