GPT-Sw3

Overview

The GPT-Sw3 model was first proposed in Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.

Since that first paper the authors have extended their work and trained new models on their new 1.2TB corpora named The Nordic Pile.

GPT-Sw3 is a collection of large decoder-only pretrained transformer language models that were developed by AI Sweden in collaboration with RISE and the WASP WARA for Media and Language. GPT-Sw3 has been trained on a dataset containing 320B tokens in Swedish, Norwegian, Danish, Icelandic, English, and programming code. The model was pretrained using a causal language modeling (CLM) objective utilizing the NeMo Megatron GPT implementation.

This model was contributed by AI Sweden Models.

Usage example

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> tokenizer = AutoTokenizer.from_pretrained("AI-Sweden-Models/gpt-sw3-356m")
>>> model = AutoModelForCausalLM.from_pretrained("AI-Sweden-Models/gpt-sw3-356m")

>>> input_ids = tokenizer("Träd är fina för att", return_tensors="pt")["input_ids"]

>>> generated_token_ids = model.generate(inputs=input_ids, max_new_tokens=10, do_sample=True)[0]

>>> print(tokenizer.decode(generated_token_ids))
Träd är fina för att de är färgstarka. Men ibland är det fint

Resources

The implementation uses the GPT2Model coupled with our GPTSw3Tokenizer. Refer to GPT2Model documentation for API reference and examples.

Note that sentencepiece is required to use our tokenizer and can be installed with pip install transformers[sentencepiece] or pip install sentencepiece

GPTSw3Tokenizer

class transformers.GPTSw3Tokenizer

< >

( vocab_file do_lower_case = False remove_space = False keep_accents = False pad_token = None unk_token = None eos_token = None bos_token = None sp_model_kwargs: Optional = None **kwargs )

Parameters

  • vocab_file (str) — SentencePiece file (generally has a .spm extension) that contains the vocabulary necessary to instantiate a tokenizer.
  • do_lower_case (bool, optional, defaults to False) — Whether or not to lowercase the input when tokenizing.
  • remove_space (bool, optional, defaults to False) — Whether or not to strip the text when tokenizing (removing excess spaces before and after the string).
  • keep_accents (bool, optional, defaults to False) — Whether or not to keep accents when tokenizing.
  • pad_token (str, optional) — The token used for padding, for example when batching sequences of different lengths. If not provided, will default to ’’ or ’’ depending on model size.
  • unk_token (str, optional) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. If not provided, will default to ’‘.
  • eos_token (str, optional) — The end of sequence token seen during pretraining. If not provided, will default to ’<|endoftext|>’
  • bos_token (str, optional) — The beginning of sequence token that can be used for downstream task, was not seen during pretraining. If not provided, will default to ’’ or ’<|endoftext|>’, depending on model size.
  • sp_model_kwargs (dict, optional) — Will be passed to the SentencePieceProcessor.__init__() method. The Python wrapper for SentencePiece can be used, among other things, to set:

    • enable_sampling: Enable subword regularization.

    • nbest_size: Sampling parameters for unigram. Invalid for BPE-Dropout.

      • nbest_size = {0,1}: No sampling is performed.
      • nbest_size > 1: samples from the nbest_size results.
      • nbest_size < 0: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
    • alpha: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout.

  • sp_model (SentencePieceProcessor) — The SentencePiece processor that is used for every conversion (string, tokens and IDs).
  • whitespaces (set) — The whitespaces that are replaced in the whitespace normalization in preprocessing.
  • non_printing_characters_re (Pattern) — The compiled regular expression to remove non-printing characters in preprocessing.

Construct an GPTSw3 tokenizer. Based on SentencePiece.

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

Example usage:

>>> from transformers import GPTSw3Tokenizer

>>> tokenizer = GPTSw3Tokenizer.from_pretrained("AI-Sweden-Models/gpt-sw3-126m")
>>> tokenizer("Svenska är kul!")["input_ids"]
[1814, 377, 3617, 63504]

save_vocabulary

< >

( save_directory: str filename_prefix: Optional = None )