About

LINSPECTOR (Language Inspector) is an open source multilingual inspector to analyze word representations. Our goal is to provide you with an easily accessible tool to gain quick insights into your word embeddings especially outside of the English language. To do this we employ simple classification tasks called probing tasks for a diverse set of languages.

Demo Video

1:41

Multilingual Static Probing Tasks

Probing tasks, also known as diagnostic classifiers, take as input a representation generated by a fully trained neural model and output predictions for a single linguistic feature of interest such as case marking, gender, person, or tense. Şahin et al. (2019) introduced multilingual probing tasks that are compatible with the linguistic properties of the selected language. Some examples of probing tasks are provided below:

Morphosyntactic Tasks

Case Marking
Arabic اِنْكِشَافٍ Genitive
Finnish tehottomaksi Translative
German Nebensatzes Genitive
Turkish kapıda Locative

Available for: Arabic, Armenian, Bulgarian, Czech, Estonian, Finnish, German, Greek, Hungarian, Macedonian, Polish, Quechuan, Russian, Serbo-Croatian, Swedish, Turkish.

Gender
Bulgarian златната Feminine
Spanish personalizado Masculine

Available for: Arabic, Bulgarian, Greek, Macedonian, Polish, Portuguese, Russian, Serbo-Croatian, Spanish.

Person
Catalan pensen Third person plural
French présentèrent Third person plural
Italian incamminò Third person singular

Available for: Arabic, Armenian, Catalan, Finnish, French, German, Greek, Hungarian, Italian, Macedonian, Polish, Portuguese, Quechuan, Romanian, Russian, Serbo-Croatian, Spanish, Turkish.

Tense
German umgesetzt Past
Hungarian méltányolandó Future

Available for: Armenian, Bulgarian, Catalan, Finnish, French, German, Greek, Hungarian, Italian, Macedonian, Polish, Portuguese, Quechuan, Romanian, Russian, Serbo-Croatian, Spanish, Turkish.

Odd and Shared Morphological Feature

Odd Morphological Feature and Shared Morphological Feature are contrastive tasks trying to predict a single odd or shared morphological feature between two tokens.

A single shared morphological feature. Şahin et al. (2019), Table 3.
Language Form 1 Form 2 Shared
German Stofftiere
stuffed animal (N.NOM.PL)
Tennisplatz
tennis court (N.NOM.SG)
Case
Russian pantera
panther (N.NOM.SG)
optimisticheskogo
optimistic (ADJ.GEN.SG)
Number
Turkish sarımsaklarım
garlic (N.PL.POSS1SG.NOM)
cümlemde
sentence (N.SG.POSS1SG.LOC)
Polarity

Available for: Arabic, Armenian, Bulgarian, Catalan, Czech, Danish, Dutch, Estonian, Finnish, French, German, Greek, Italian, Macedonian, Polish, Portuguese, Quechuan, Romanian, Russian, Serbo-Croatian, Spanish, Swedish, Turkish.

A single odd morphological feature. Şahin et al. (2019), Table 4.
Language Form 1 Form 2 Odd
German integriert
integrate (V.3SG.IND.PRS)
rechnet
count (V.3SG.IND.PRS)
Lemma
Spanish legalisada
legalized (V.SG.PTCP.F)
legalisado
legalized (V.SG.PTCP.M)
Gender
Turkish seçenekler
option (N.NOM.PL)
seçeneklere
option (N.DAT.PL)
Case

Available for: Armenian, Czech, Finnish, German, Greek, Hungarian, Macedonian, Polish, Portuguese, Quechuan, Romanian, Russian, Serbo-Croatian, Spanish, Swedish, Turkish.

Pseudoword

An artifical word without any particular meaning but sounds like a word from the given language. The classifier tries to predict whether a given token is a pseudoword or a word with an attached meaning.

Pseudowords. Şahin et al. (2019), Table 2.
Language Pseudoword
Dutch nerstbare, openkialig, inwrannees, tedenjaaigige, wuitje
English atlinsive, delilottent, foiry
French souvuille, faicha, blêlament
German Anstiffung, hefumtechen, Schlauben, Scheckmal, spüßten

Available for: Basque, Dutch, English, French, German, Serbian, Spanish, Turkish, Vietnamese.

Multilingual Contextual Probing Tasks

LINSPECTOR also supports the multilingual contextual probing tasks by Şahin et al. (2019). An instance of the contextual probing tasks consists of a sentence, a label and an index (starts with 0) which assigns the label to the correct word. A few examples are shown below.

Case Marking
Language Sentence Index Label
Basque Oraindik ez du ziurtatuta txartela . 4 Absolutive
Finnish Oli kohdannut oikean kirjan ja oikean kohdan . 5 Genitive
German Von mir gibt es deshalb 5 Sterne ! 6 Nominative
Turkish Çok kıymetli arkadaşlardır . 2 Nominative

Available for: Ancient Greek, Arabic, Basque, Croatian, Czech, English, Estonian, Finnish, German, Gothic, Greek, Hindi, Hungarian, Latin, Latvian, Norwegian, Old Church Slavonic, Polish, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Turkish, Ukrainian, Urdu.

Gender
Language Sentence Index Label
Bulgarian Откъде черпите тези данни ? 3 Feminine
German Der Sendebetrieb wurde im September 2010 aufgenommen . 1 Masculine
Spanish Ése es el lado bueno de su gestión . 2 Masculine

Available for: Ancient Greek, Arabic, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, French, German, Gothic, Greek, Hebrew, Hindi, Italian, Latin, Latvian, Norwegian, Old Church Slavonic, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Ukrainian, Urdu.

Person
Language Sentence Index Label
Catalan És un atac directe contra vostès ? 5 Second person plural
French Il faisait partie du district de Laon . 1 Third person singular
Italian In quanti hanno risposto ? 2 Third person plural

Available for: Ancient Greek, Arabic, Bulgarian, Catalan, Croatian, Czech, English, Estonian, Finnish, French, German, Gothic, Hebrew, Hindi, Italian, Latin, Norwegian, Old Church Slavonic, Persian, Polish, Portuguese, Romanian, Russian, Slovenian, Spanish, Turkish, Urdu.

Tense
Language Sentence Index Label
Hindi हमने उनकी बात सुनी ै । 4 Present
Swedish Vi vet också att han skrev böcker . 5 Past

Available for: Ancient Greek, Bulgarian, Catalan, Croatian, Czech, Dutch, English, Estonian, Finnish, French, German, Gothic, Hindi, Italian, Latin, Latvian, Norwegian, Old Church Slavonic, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Turkish.

AllenNLP

To probe AllenNLP models LINSPECTOR requires a model.tar.gz archive which is automatically generated in the serialization directory.

During probing a number of tokens are passed through the model to extract word embeddings. Those embeddings are then used as an embeddings layer for a simple linear classifier which is trained and evaluated using the intrinsic data provided by Şahin et al. (2019).

Currently we support a limited number of vanilla AllenNLP models listed on the landing page.

HuggingFace's Transformers

To probe HuggingFace's Transformers models LINSPECTOR requires a .zip archive. It must contain the model and the tokenizer. These files can be saved through the model.saved_pretrained('directory') and tokenizer.save_pretrained('directory') methods.

LINSPECTOR gives the model sentences and extracts the embeddings of single words.

The currently supported HuggingFace's Transformers models are listed on the landing page.

Static Embeddings

For static embeddings files each line should contain a token followed by a vector separated by whitespace.

Example:

katapultiert 0.019 -0.29 -0.34 0.076 ...
rutsch 0.019 -0.29 -0.34 0.076 ...

All tokens will be lowercased and used as an embeddings layer similar to AllenNLP models.

The required file extension is .vec.

Contributions

LINSPECTOR was originally developed in 2019 by Max Eichler as part of his Bachelor's thesis and is supervised by Gözde Gül Şahin. Contextual probing has been added by Marvin Kaster.