1. Introduction
Authorship attribution – the task of identifying the author of a given document based on a set of possible candidates (Potthast et al. Reference Potthast, Rangel, Tschuggnall, Stamatatos, Rosso and Stein2017) – is present in a wide range of text forensics and related applications. These include, for instance, corruption investigation models (Chen et al. Reference Chen, Hao, Chandramouli and Subbalakshmi2011; Juola and Stamatatos Reference Juola and Stamatatos2013), on-line abuse (Vartapetiance and Gillam Reference Vartapetiance and Gillam2012), fake news (Peng, Choo, and Ashman Reference Peng, Choo and Ashman2016) and false impersonation detection (Koppel and Seidman Reference Koppel and Seidman2018), among many others.
Studies in the field will usually draw a distinction between closed-set and open-set settings (Kestemont et al. Reference Kestemont, Stamatatos, Manjavacas, Daelemans, Potthast and Stein2019). In closed-set authorship attribution, the target author of the input document is assumed to be a member of the set of possible candidates and, as a result, the task consists of selecting the correct candidate among the alternatives provided. In open-set authorship attribution, by contrast, the target author may or may not be found within the candidate set, and therefore the task involves determining whether the author is unknown as well. In what follows, we shall focus on closed-set authorship attribution only.
Authorship attribution has been a popular research topic in Natural Language Processing and the subject of several shared tasks in the PAN-CLEF authorship attribution series (Potthast et al. Reference Potthast, Rangel, Tschuggnall, Stamatatos, Rosso and Stein2017; Kestemont et al. Reference Kestemont, Tschugnall, Stamatatos, Daelemans, Specht, Stein and Potthast2018, Reference Kestemont, Stamatatos, Manjavacas, Daelemans, Potthast and Stein2019). Closed-set authorship attribution, in particular, is usually modelled as a supervised learning task, making use of text corpora labelled with author identifiers representing the classes (or authors) to be identified. Popular methods include the use of support vector machine classifiers (Schwartz et al. Reference Schwartz, Tsur, Rappoport and Koppel2013; Stamatatos Reference Stamatatos2017), recurrent neural networks (Bagnall Reference Bagnall, Cappellato, Ferro, Macdonald and Balog2016; Jafariakinabad and Hua Reference Jafariakinabad and Hua2019), convolution neural networks (Sari and Stevenson Reference Sari and Stevenson2016; Shrestha et al. Reference Shrestha, Sierra, Gonzalez, Rosso, Montes-Y-Gomez and Solorio2017; Misra et al. Reference Misra, Devarapalli, Ringenberg and Rayz2019) and stacks of ensemble classifiers (Custódio and Paraboni Reference Custódio and Paraboni2019), as we shall discuss later.
Despite advances in recent years, authorship attribution continues to attract interest as a research problem (Kestemont et al. Reference Kestemont, Stamatatos, Manjavacas, Daelemans, Potthast and Stein2019). At the same time, significant advances have also been observed in the related field of author profiling, that is, the computational task of learning author demographics from text (Silva and Paraboni Reference Silva and Paraboni2018; Rangel and Rosso Reference Rangel and Rosso2019). As in the case of closed-set authorship attribution, author profiling is often modelled as a supervised problem (i.e., relying on text corpora labelled with demographics information.) Systems of this kind have been applied to a wide range of tasks, most noticeably in gender and age classification (Kim et al. Reference Kim, Xu, Qu, Wan and Paris2017; Takahashi et al. Reference Takahashi, Tahara, Nagatani, Miura, Taniguchi and Ohkuma2018; Rangel et al. Reference Rangel, Rosso, Zaghouani and Charfi2020), but also in the recognition of personality traits (dos Santos, Ramos, and Paraboni Reference dos Santos, Ramos and Paraboni2019), bot detection (Pizarro Reference Pizarro2019) and many others.
1.1 Authorship attribution using author profiling
Given the close relation between authorship attribution and author profiling – in the sense that both tasks are focused on gaining knowledge about the individual who wrote a piece of text – in the present work we shall argue that demographics predictions obtained from author profiling methods may help reduce the search space (i.e., the number of author candidates under consideration) in the authorship attribution task and, as a result, improve overall accuracy.
To illustrate this, let us assume, for instance, that we are able to infer the gender (male/female) of the individual who wrote a given piece of text by using a standard author profiling method as in, for example, Basile et al. (Reference Basile, Dwyer, Medvedeva, Rawee, Haagsma and Nissim2017) or many others. In this case, gender predictions may effectively split the set of candidates under consideration into two groups (i.e., men and women), letting an authorship attribution model to focus on the subset of candidate authors of interest. Moreover, as we shall argue in the present study, the same principle may apply not only to standard gender (or age) author profiling but also to many other (perhaps less usual) tasks, including the use of classifiers for education level, political orientation, degrees of religiosity or indeed for potentially any kind of demographics information that may be reliably inferred from labelled corpora using supervised machine learning.
Using author profiling classifiers as an aid to the authorship attribution task might seem intuitive, and it has been indeed addressed in the context of aggressive language detection (Casavantes, López, and González Reference Casavantes, López and González2019; Garrido-Espinosa, Rosales-Pérez, and López-Monroy Reference Garrido-Espinosa, Rosales-Pérez and López-Monroy2020) and other tasks. This, however, gives rise to the question of how the two tasks may be combined, and whether using possibly suboptimal profiling estimators (as it may often be the case) in this way may actually harm results. To shed light on these issues, the present work considers the authorship attribution model described in Custódio and Paraboni (Reference Custódio and Paraboni2019), which consists of a stack of classifiers focused on different aspects of the input text (words, characters and text distortion patterns.) An ensemble architecture along these lines – which obtained the overall best results in closed-set authorship attribution at the PAN-CLEF shared task (Kestemont et al. Reference Kestemont, Tschugnall, Stamatatos, Daelemans, Specht, Stein and Potthast2018) – may not only provide a suitable basis for an extension using multiple author profiling classifiers but, as we shall argue, may actually benefit even from suboptimal profiling estimators.
1.2 Goals and contributions
Based on these observations, this work describes a number of experiments using an authorship attribution model enriched with author profiling classifiers. In doing so, our goal is to verify whether the present method may improve results in a stack architecture as proposed in Custódio and Paraboni (Reference Custódio and Paraboni2019) by considering a range of domains and languages and a number of standard and less-known author profiling tasks alike.
The main contributions of the present study are summarised below.
-
A novel approach to closed-set authorship attribution that enriches an existing top-performing ensemble model with author profiling predictions.
-
Proposed approach compares favourably to previous work in the field for a number of domains, languages, candidate set sizes and tasks.
-
Author profiling models that go beyond standard gender and age classification, including classifiers for education level, political orientation, degrees of religiosity and others.
The remainder of this article is structured as follows. Section 2 provides an overview of recent computational approaches to author profiling and authorship attribution methods alike. Section 3 describes a pilot experiment intended to illustrate how having access to author demographics information may improve results of the authorship attribution task in the intended stack architecture. Section 4 presents our extended approach to authorship attribution and the author profiling models under consideration. Section 5 describes the evaluation procedure, training and test data sets for our experiments. Section 6 presents results from both individual author profiling classifiers and the extended authorship attribution model. Finally, Section 7 presents final remarks and discusses future extensions.
2. Background
In this section, we briefly review existing work in the author profiling (Section 2.1) and authorship attribution (Section 2.2) fields. For further details, we report also to the results of the recent shared tasks devoted to each task in Rangel and Rosso (Reference Rangel and Rosso2019) and Kestemont et al. (Reference Kestemont, Stamatatos, Manjavacas, Daelemans, Potthast and Stein2019), respectively.
2.1 Author profiling
Computational author profiling consists of inferring author demographics from text. Gender and age recognition are by far the most popular tasks of this kind found in the literature and are often addressed by using supervised machine learning methods. Author profiling has been the centre of a number of shared tasks in the PAN-CLEF series (Rangel and Rosso Reference Rangel and Rosso2019), most notably focused on age and gender prediction in the Twitter domain, although other tasks (e.g., recognising personality traits, language variation, bots, etc.), languages (e.g., Arabic, Dutch etc.) and modalities (e.g., learning from both images and texts) have been addressed as well.
As a brief introduction to recent approaches to author gender and age profiling, Table 1 summarises a number of selected studies in the field, including some of the top-performing systems at PAN-CLEF in 2017 (Basile et al. Reference Basile, Dwyer, Medvedeva, Rawee, Haagsma and Nissim2017), 2018 (Takahashi et al. Reference Takahashi, Tahara, Nagatani, Miura, Taniguchi and Ohkuma2018) and 2019 (Pizarro Reference Pizarro2019).
We notice that most approaches are based on Twitter data, make use of word- and character n-gram models, and often based on SVM or logistic regression classifiers. Further details are discussed as follows.
The early work in Nguyen et al. (Reference Nguyen, Trieschnigg, Dogruoz, Gravel, Theune, Meder and de Jong2014) introduces a number of useful insights in gender and age prediction alike. The study compares machine and human performance in gender and age prediction from Twitter texts and discusses a number of the limitations of popular computational approaches to these tasks. The study points out differences between social and biological identities, and shows that, for over 10% of Twitter users, there is a mismatch between their biological sex and the kind of language they use on social media, and that older users tend to be perceived to be younger than what they actually are. The study makes use of Dutch tweets translated to English and compares standard computational models (linear regression for age prediction, and logistic regression for gender prediction) with human evaluation. A majority-vote model obtains an accuracy of 0.84, which is similar to existing author profiling classifiers for English Twitter data.
The work in Basile et al. (Reference Basile, Dwyer, Medvedeva, Rawee, Haagsma and Nissim2017) may be seen as a standard approach to author gender profiling, and it was the overall best-performing participant in the PAN-CLEF-2017 author profiling shared task (Rangel et al. Reference Rangel, Rosso, Potthast and Stein2017). The system obtained 0.83 average accuracy in author gender classification by making use of a linear SVM model with word unigrams and 3.5 character n-gram counts as learning features. Other language- and domain-related features such as part-of-speech (POS) tags and Twitter handles were found to actually harm overall accuracy.
In the work in Reddy et al. (Reference Reddy, Vardhan and Reddy2017), by contrast, the use of POS information plays a more prominent role in a gender classification task. The study introduces a TF-IDF (term frequency–inverse document frequency) weighted POS n-gram model that outperforms a number of standard baseline alternatives (e.g., bag of words, etc.) in the hotel reviews domain.
Unlike most data-driven approaches to author profiling, in Isbister et al. (Reference Isbister, Kaati and Cohen2017), author gender classification is addressed with the aid of psycholinguistic features computed from the Language Inquiry and Word Count (LIWC) dictionary (Pennebaker, Francis, and Booth Reference Pennebaker, Francis and Booth2001). Results from SVM classifiers highlight the role of different LIWC categories in the task and differences across languages.
The work in Kim et al. (Reference Kim, Xu, Qu, Wan and Paris2017) addresses the issues of gender, age and user type Twitter profiling in the English language by classifying graph vertices with the aid of recursive neural networks (RNNs.) To this end, network, text and label information are combined into tree structures and fed into individual RNNs. The approach is found to outperform a number of robust baseline systems (lexica, logistic regression, label propagation, text-associated DeepWalk and Tri-Party Deep Network Representations) in the three tasks under consideration.
The work in Takahashi et al. (Reference Takahashi, Tahara, Nagatani, Miura, Taniguchi and Ohkuma2018) was the overall best-performing system in the PAN-CLEF 2018 author profiling shared task (Rangel et al. Reference Rangel, Rosso, Montes-y-Gómez, Potthast and Stein2018), addressing the issue of gender classification based on multimodal input, that is, conveying both text and image data. To this end, a neural approach called ‘Text Image Fusion Neural Network’ (TIFNN) is introduced in order to leverage both data sources and produce gender predictions accordingly.
Finally, we notice that many of the early approaches to gender and age profiling have been recently outperformed by the work in Rangel et al. (Reference Rangel, Rosso, Zaghouani and Charfi2020), which enhances previous methods with the use of a novel text representation – called LDSE (Low-Dimensionality Statistical Embedding) – that takes into account the word distributions in each author profiling class. In the present work, however, since author profiling is viewed simply as a tool to improve our main task (i.e., authorship attribution), we shall focus on a simple approach to author profiling along the lines in Basile et al. (Reference Basile, Dwyer, Medvedeva, Rawee, Haagsma and Nissim2017), Reddy et al. (Reference Reddy, Vardhan and Reddy2017) and others by making use of word-based models and logistic regression as discussed in Section 4.
2.2 Authorship attribution
Closed-set authorship attribution (hereby called authorship attribution, for short) concerns the computational task of selecting the author of a given document from a well-defined set of candidates (Stamatatos Reference Stamatatos2017). As in the case of author profiling, authorship attribution often resorts to supervised machine learning methods, and it has been the focus of several shared tasks in the PAN-CLEF series (Kestemont et al. Reference Kestemont, Stamatatos, Manjavacas, Daelemans, Potthast and Stein2019), in addition to the related tasks of author clustering (Potthast et al. Reference Potthast, Rangel, Tschuggnall, Stamatatos, Rosso and Stein2017), open-set authorship attribution (Kestemont et al. Reference Kestemont, Stamatatos, Manjavacas, Daelemans, Potthast and Stein2019) and others.
Table 2 summarises a number of recent studies in closed-set authorship attribution. The list, which is by no means complete, is solely intended to illustrate a variety of recent approaches to the task.
Generally speaking, existing approaches to authorship attribution are largely based on word- and character n-gram models, with some methods (Stamatatos Reference Stamatatos2017; Markov et al. Reference Markov, Stamatatos and Sidorov2017) resorting to text distortion (Granados et al. Reference Granados, Cebrián, Camacho and de Borja Rodrguez2011) to omit certain parts of the input text whilst focusing on others. SVM classifiers are among the most popular strategies, and input size – which may be a particular concern in the Twitter domain – has been found to be correlated with overall accuracy (Rocha et al. Reference Rocha, Scheirer, Forstall, Cavalcante, Theophilo, Shen, Carvalho and Stamatatos2017). Most recent studies are devoted to the English language, with the exception of those related to the PAN-CLEF authorship attribution task in Kestemont et al. (Reference Kestemont, Tschugnall, Stamatatos, Daelemans, Specht, Stein and Potthast2018). Individual details are discussed as follows.
The study in Hinh et al. (Reference Hinh, Shin and Taylor2016) makes use of frame semantics from FrameNet (Baker, Fillmore, and Lowe Reference Baker, Fillmore and Lowe1998) to build a bag of frames representation intended to capture an author’s writing style. The model comprises features representing frame semantics statistics such as frame element (FE) counts, average number of FEs per frame and others, and it is compared against a baseline model comprising text-related features such as vocabulary size, word and character counts. Results from a SVM classifier show that the frame semantics authorship attribution model is consistently superior to the baseline in a corpus of adversarial stylometric data.
The work in Stamatatos (Reference Stamatatos2017) makes use of a text distortion method inspired from Granados et al. (Reference Granados, Cebrián, Camacho and de Borja Rodrguez2011), in which rare words are replaced by sequences of a special symbol ‘*’, and more frequent words are kept unchanged. In doing so, authorship attribution SVM classifiers are able to focus on the text fragments that are deemed more relevant to the task. Results suggest, among other findings, that the method does improve overall accuracy and that function words are less suitable to text distortion.
Text distortion is also performed at the pre-processing stage of input texts for authorship attribution in Markov et al. (Reference Markov, Stamatatos and Sidorov2017). In this case, numbers, named entities and highly frequent words are replaced by special symbols. Results from SVM and multinomial Naive Bayes classification suggest that the method compares favourably to a standard bag-of-words approach without pre-processing.
The work in Shrestha et al. (Reference Shrestha, Sierra, Gonzalez, Rosso, Montes-Y-Gomez and Solorio2017) investigates a number of CNN architectures for authorship attribution in social media texts, taking as an input character unigram and bigram embeddings, and skip-gram word embedding representations. Results are compared against those obtained by a range of baseline systems, including the use of logistic regression with variable length character n-grams, and Long Short-Term Memory networks (LSTMs) with character bigrams.
The work in Rocha et al. (Reference Rocha, Scheirer, Forstall, Cavalcante, Theophilo, Shen, Carvalho and Stamatatos2017) focuses on the issues of closed- and open-set authorship attribution (of which only closed-set scenarios are presently dealt with) with limited input data using short texts from a corpus of 10 million tweets posted by 10,000 users (authors). The work makes use of SVM, Random Forest, distance-based and text compression methods built from word, character and POS n-grams. A number of experiments were carried out by varying both the number of candidate authors under consideration and the number of input texts per author. Among other findings, the study suggests that the text compression method outperforms the alternatives for small input sizes and that overall accuracy decreases linearly as the number of candidate authors increases.
The study at Sundararajan and Woodard (Reference Sundararajan and Woodard2018) investigates the role of syntax and word choice in authorship attribution, which may be particularly relevant to cross-genre scenarios in which content-based information does not play a significant role. Syntax is investigated with the aid of context-free probabilistic grammars (PCFG) and Markov chain models, and the issue of word choice is addressed by masking out certain words or topics (which may be seen as an instance of text distortion) corresponding to different POS categories. Results suggest that cross-genre scenarios may benefit from syntactic knowledge, whereas both single- and cross-domain scenarios may benefit from lexical knowledge. Moreover, purely syntactic models were found to be insufficient by themselves, and may require combination with more content-oriented (e.g., character-based) models. In particular, common nouns, verbs, adjectives and adverbs were found to help author identification, whereas proper nouns do not.
The study in Stevenson et al. (Reference Stevenson, Vlachos and Sari2017) addresses the use of continuous word and character n-gram representations for authorship attribution in four domains using fastText (Joulin et al. Reference Joulin, Grave, Bojanowski and Mikolov2017). In doing so, the model focuses on short word and character sequences, but it does not keep track of longer dependencies. Feature representations and classifiers are built jointly by adapting the fastText shallow architecture, and results suggest that the use of continuous character n-gram representations outperform a number of baseline systems (and the use of continuous word n-grams) in two domains (news and reviews.) On the other hand, the use of topic modelling was still superior in the case of legal texts authorship attribution.
The work in Patchala and Bhatnagar (Reference Patchala and Bhatnagar2018) introduces an authorship attribution model based on topic-independent syntactic templates built from each candidate author of interest, and which are intended to represent an individual’s writing style. Results obtained from a number of standard classifiers (e.g., SVM, Naive Bayes and others) suggest that the combination of parsed tree structures and additional syntactic features outperforms the use of individual features alone.
The study in Reddy et al. (Reference Reddy, Reddy, Chand and Venkannababu2018) introduces an instance-based authorship attribution method that relies on author-specific document weights to represent input texts, rather than document features or terms. Document weights are obtained by first computing terms weights, which are subsequently normalised by author. A number of experiments – with and without document weighting – were carried out using standard classifiers (logistic regression, Naive Bayes and Random Forest) based on a bag of words model. Results from a small (10-authors) reviews corpus suggest that document weighting generally increases task accuracy.
The work in Jafariakinabad and Hua (Reference Jafariakinabad and Hua2019) presents a neural model that encodes document information from lexical, syntactic and structural levels for authorship attribution. In this approach, syntactic and lexical sentence representations are jointly encoded, and subsequently an attention-based hierarchical network encodes the syntactic and semantic structures of input texts themselves while rewarding those that help capturing the writing style of their authors. The model is evaluated against a number of SVM and CNN baseline systems, including the approaches in Shrestha et al. (Reference Shrestha, Sierra, Gonzalez, Rosso, Montes-Y-Gomez and Solorio2017) and Stevenson et al. (Reference Stevenson, Vlachos and Sari2017). Results show the strength of each individual level of document information and suggest that the proposed model outperforms the baseline alternatives and its individual components alike.
Unlike existing machine learning approaches to authorship attribution, the work in Sharon Belvisi et al. (Reference Sharon Belvisi, Muhammad and Alonso-Fernandez2020) takes a forensic approach to the task by comparing the use of standard n-gram and stylometric features (e.g., character, word and punctuation counts etc.) through text similarity. More specifically, the evaluation of different features is carried out by measuring the similarity between representations of different authors using Cosine, Euclidean and Manhattan distances. Results based on a small (40-users) Twitter corpus suggest that the use of idiosyncratic features (e.g., misspellings, abbreviations, emoji counts, etc.) outperforms the use of n-gram counts by a small margin.
Finally, the stack ensemble approach in Custódio and Paraboni (Reference Custódio and Paraboni2019) will be taken as the starting point to the present work, and for that reason is discussed in more detail in the next section.
2.3 EACH-USP ensemble approach to authorship attribution
The EACH-USP approach to closed-set authorship attribution described in Custódio and Paraboni (Reference Custódio and Paraboni2019) is based on the assumption that identifying the author of a given document may require relying on multiple knowledge sources. To this end, the approach makes use of standard word- and character-based n-gram models, and an additional character-based model subject to text distortion (Granados et al. Reference Granados, Cebrián, Camacho and de Borja Rodrguez2011). The output probabilities of the three models – hereby called word, char and distorted char – are combined in a stack architecture (Wolpert Reference Wolpert1992) and subject to a second-level logistic regression classifier to determine the author of an input document. This architecture is illustrated in Figure 1.
Text distortion has been introduced in Granados et al. (Reference Granados, Cebrián, Camacho and de Borja Rodrguez2011) and has been previously considered in authorship attribution (Stamatatos Reference Stamatatos2017), deception detection (Sánchez-Junquera et al. Reference Sánchez-Junquera, nor Pineda, y Gómez, Rosso and Stamatatos2020) and other tasks, and it is largely intended to mask out words that are not relevant to the task. In Custódio and Paraboni (Reference Custódio and Paraboni2019), by contrast, text distortion is performed at the character level. More specifically, the model replaces every character in the input text – except punctuation and diacritics – for a ‘*’ symbol so that the model is able to focus on these particular patterns. An example of text distortion of this kind – rendered in Portuguese to show diacritics usage – is illustrated in Table 3.
The attention to punctuation and diacritics patterns has been found to be particularly useful for more general, cross-domain authorship attribution tasks in multiple languages, as in Kestemont et al. (Reference Kestemont, Tschugnall, Stamatatos, Daelemans, Specht, Stein and Potthast2018). As for the combination of the three individual classifier components, this works as follows. First, the set of d input documents is vectorised by making use of a word-, char- or distortion-based feature extraction function V(d) as required by each model (or channel), and the resulting feature set X is normalised by a function N(X). Next, X is subject to PCA dimensionality reduction, and a multinomial classifier generates the probability $P(Y=k)$ for each class k.
First-level classifiers are optimised by making use of a second-level model $\sigma (\sum_c \sum_i(w_{ci} * c_i) + k) $ , where $c_i$ is the probability of a candidate author i being the actual author of the given document according to the c classifier, $w_{ci}$ is the weight of $c_i$ , k is a constant and $\sigma$ is the sigmoid function. This produces a new s vector of i probabilities of each candidate author (or class) being the author of the document.
This stack ensemble approach was evaluated at the PAN-CLEF authorship attribution shared task in Kestemont et al. (Reference Kestemont, Tschugnall, Stamatatos, Daelemans, Specht, Stein and Potthast2018), which considered cross-domain authorship attribution scenarios based on fan fiction texts written in five languages, and required identifying the author of a text written in a particular genre (e.g., Harry Potter) based on texts written in a different genre (e.g., Star Wars.) Given the top-performing results reported in Kestemont et al. (Reference Kestemont, Tschugnall, Stamatatos, Daelemans, Specht, Stein and Potthast2018), and the observation that a stack architecture of this kind may be easily extended with any number of additional (e.g., author profiling) classifiers, this approach will be taken as the basis to the present work as well.
3. Pilot study: does author profiling information help authorship attribution?
Before introducing our current profiling-based approach to authorship attribution, we will first examine the extent to which having access to author demographics information may actually help author identification. To this end, we envisaged a simple pilot study in which author gender information available from a labelled corpus is fed directly into the authorship attribution ensemble in Custódio and Paraboni (Reference Custódio and Paraboni2019). This strategy, which amounts to using ground truth information instead of predictions made by an author gender classifier, is intended to illustrate whether using gender information may help authorship attribution at all and, if so, what upper and lower limits of accuracy an actual gender classifier would be expected to achieve in order to effectively help authorship attribution. After discussing these issues, the use of actual classifiers will be the focus of our main approach in Section 4.
3.1 Overview
The present analysis makes use of the b5-corpus of Facebook texts described in Ramos et al. (Reference Ramos, Neto, Silva, Monteiro, Paraboni and Dias2018), which will be further discussed in Section 4.1 as part of our main author profiling experiments. In this experiment, binary gender labels (male/female) available from the corpus are added as a fourth information source to the authorship attribution ensemble in Custódio and Paraboni (Reference Custódio and Paraboni2019), that is, in addition to the word, character and text distortion channels described in the previous Section 2.3. The resulting ensemble is illustrated in Figure 1 and essentially differs from the original architecture only by presenting a fourth (blue) channel at the bottom, which is intended to represent the gender label information taken from the input texts.
The use of binary gender information in this way is similar to the ap.label approach to be discussed in Section 4.2. In its present form, 0/1 gender labels are combined with the probabilities obtained by the three ensemble components and taken as the input to the second-level authorship attribution stack classifier.
Using gender labels available from the input text will provide us with an upper limit for the accuracy that the ensemble authorship attribution model may be able to achieve when using an optimal author profiling classifier. In practice, however, author profiling classifiers will most likely obtain much lower results. To shed light on this issue, we ran a number of simulations in which different levels of noise were added to the model, so that the actual gender information was corrupted by a certain margin. By comparing multiple authorship attribution scenarios based on gender estimates of varying degrees of robustness, we would like to establish the lower limit of accuracy that a gender classifier would be expected to achieve in this particular setting.
3.2 Procedure
The 50 authors with the largest amount of text available from the corpus were selected for this analysis, and their texts were split into document units (or posts) at line breaks. The study consisted of comparing authorship attribution results obtained by the standard EACH-USP approach in Custódio and Paraboni (Reference Custódio and Paraboni2019), which is presently taken as a baseline system, and its extended version that includes gender information with a certain level of added noise. Assuming that gender labels available from the corpus are 100% correct, we tested a number of scenarios in which gender information was corrupted so as to obtain 90%, 80%, 70%, 60% and 50% accuracy, hence simulating author profiling classifiers of different levels of robustness.
Testing was carried out as follows. First, two authors are randomly selected and taken as the input to both models (with and without gender information.) Next, additional authors are randomly selected one at a time, and the procedure is repeated until reaching 50 authors. For the largest (i.e., 20-author) setting, this corresponds to 5600 train and 2400 test documents. At each turn, we compute accuracy, precision, recall and F1 measures. In order to minimise possible effects of random selection, the experiment is repeated 20 times, and we report its overall mean results.
3.3 Results
Table 4 summarises mean results for the EACH-USP baseline method and the alternatives that have access to additional gender information with different degrees of accuracy, ranging from 100% (hence simulating an optimal gender classifier) to 50%.
From these results, a number of observations are warranted. First, we notice that using an optimal gender classifier (as in the 100% column) would indeed help authorship attribution by a considerable margin, that is, overall accuracy would be increased by 9 points (from 0.33 to 0.42) in this particular scenario. This represents the upper (and in practical terms possibly unachievable) limit for a method based on gender author profiling classifiers. Second, we notice that using a suboptimal gender classifier would still be helpful (i.e., outperforming the use of the baseline ensemble alone) if the classifier accuracy is above 70%, which is therefore the lower limit for a gender classifier in this scenario.
Finally, we also notice that, for both approaches, accuracy decreases uniformly as the number of candidate authors (or classes to be learned) is increased. This effect, which is to be expected in a multi class machine learning setting of this kind, is illustrated in Figure 3, and it is consistent with the findings in Rocha et al. (Reference Rocha, Scheirer, Forstall, Cavalcante, Theophilo, Shen, Carvalho and Stamatatos2017), in which experiments in authorship attribution involving multiple candidate set sizes have been discussed at length.
Put together, these results have motivated us to implement a range of author profiling models (which are by definition suboptimal) and use their predictions in an ensemble approach to authorship attribution along the lines of the present pilot study.
4. Authorship attribution using author profiling classifiers
As discussed in the previous section, the use of ground truth author demographics obtained from corpus labels to aid the authorship attribution task suggests that author profiling methods may obtain comparable results in an automatic fashion, that is, without resorting to corpus annotation directly. To put this idea to the test, in this section, we introduce a stack ensemble method that extends the approach in Custódio and Paraboni (Reference Custódio and Paraboni2019) with a number of independently-built author profiling classifiers as an aid to authorship attribution in different domains and languages.
Unlike the experiment in the previous section, however, the present approach will not rely on the actual demographics about the authors to be identified, using instead predictions made by multiple models built from a disjoint data set (i.e., which does not include any author under identification.) In other words, author profiling and authorship attribution models are independently built from different data and, despite the use of supervised author profiling methods, the present authorship attribution approach does not require the input documents to be labelled with author demographics, taking as an input only a standard set of documents labelled with unique identifiers as in the existing work in the field.
The reminder of this section will focus on the use of one author profiling classifier at a time, leaving the discussion on how to combine multiple classifiers in a single task to be dealt with in Section 6.4.
4.1 Data
Using author profiling classifiers to aid authorship attribution requires text documents labelled with both author demographics information (in order to train the author profiling classifiers) and unique author identifiers (to train the authorship attribution model proper.) This unfortunately rules out many of the existing corpora available for the purpose of authorship attribution, including those made available by the PAN-CLEF shared tasks (Kestemont et al. Reference Kestemont, Stamatatos, Manjavacas, Daelemans, Potthast and Stein2019) since those corpora are generally labelled only with author identifiers, but not with author demographics. Corpora developed for author profiling tasks, on the other hand, will obviously provide demographics information, but author identifiers are often unavailable.
Based on these observations, we selected a number of publicly available corpora in different domains and languages, and whose text documents are suitably labelled for both author profiling and authorship attribution tasks as required by our combined approach. More specifically, our models will be built from text in four domains: blog texts from the Blog Authorship corpus (Schler et al. Reference Schler, Koppel, Argamon and Pennebaker2006), Facebook posts from the b5-post corpus (Ramos et al. Reference Ramos, Neto, Silva, Monteiro, Paraboni and Dias2018), short essay texts about topics of a moral nature (e.g., abortion legalisation, death penalty, etc.) from the BRmoral corpus (dos Santos and Paraboni Reference dos Santos and Paraboni2019; Pavan et al. Reference Pavan, dos Santos, Lan, ao Trevisan Martins, dos Santos, Deutsch, da Costa, Hsieh and Paraboni2020) and Twitter data from the TwiSty corpus (Verhoeven, Daelemans, and Plank Reference Verhoeven, Daelemans and Plank2016). We notice that, in addition to providing a certain level of variety to our experiments (and hence reducing possible effects of topical bias and others, as discussed in Sari, Stevenson, and Vlachos Reference Sari, Stevenson and Vlachos2018), some of these choices were motivated by our particular interest in Portuguese NLP, or were aimed at investigating the use author profiling tasks beyond standard gender and age classification.
All data sets are labelled with unique author identifiers. Blogs, Facebook and essay texts are labelled so as to support multiple author profiling tasks in one single language each, whereas Twitter texts are labelled only with author gender information available in six languages (and which may therefore be regarded as six independent corpora.) Descriptive statistics for each corpus are summarised in Table 5 and further discussed below.
Possible author profiling tasks are determined by the labels available from each corpus. All corpora are labelled with author binary gender (G) information, and therefore support binary (male/female) gender classification. With the exception of the Twitter domain, all corpora are also labelled with age (A) information, which has been presently modelled as a 3-class problem based on the distribution of each corpus. IT background is a binary label available from the b5-post (Facebook) and BRmoral (Essay) corpora only, both of which containing a significant proportion of text produced by students in Computer Science and related fields, and which indicates whether each author in the corpus has this kind of background or not. Level of Education (E), political orientation (P) and degree of religiosity (R) are crowd-sourced, self-reported labels available in the essay domain represented by the BRmoral corpus. Each of these labels supports a ternary classification problems (from basic to superior education, from left to right political orientation and from no religious at all to highly religious.) For details regarding the BRmoral corpus and its annotation scheme, we refer to Pavan et al. (Reference Pavan, dos Santos, Lan, ao Trevisan Martins, dos Santos, Deutsch, da Costa, Hsieh and Paraboni2020).
4.2 Author profiling and authorship attribution models
As in the present work, author profiling classifiers have been developed only as a support to the main task of authorship attribution, in what follows we take a standard approach to the task by making use of TF-IDF unigram counts and multinomial logistic regression with univariate feature selection along the lines of Hsieh, Dias, and Paraboni (Reference Hsieh, Dias and Paraboni2018) and others. In all models, logistic regression uses L2 regularisation and newton-cg solver with a 0.0001 tolerance as a stopping criteria. For reasons discussed below, depending on the authorship attribution strategy under consideration, author profiling predictions obtained by performing logistic regression may be taken either as class probabilities or as actual class labels.
Regarding the authorship attribution task proper, the present work essentially extends the EACH-USP stack authorship attribution approach in Custódio and Paraboni (Reference Custódio and Paraboni2019) by adding author profiling classifiers to the existing ensemble of word, character and distorted character models as discussed in Section 2. In other words, the actual architecture is similar to the previous Figure 2, except that author demographics information will be presently inferred from text automatically with the aid of author profiling classifiers, rather than taken from ground truth corpus labels. This is illustrated in Figure 4 using two author profiling modules as an example (in light colour, at the bottom) and further discussed below.
Two strategies for adding author profiling predictions to the ensemble, hereby called ap.prob and ap.label, are presently considered. These strategies differ from each other only in the way their output predictions are represented. In ap.prob, we use author profiling predictions represented as probabilities not unlike the output of any of the existing components of the stack ensemble model. In ap.label, by contrast, we use class labels predictions (e.g., for gender, age etc.) In doing so, we would like to investigate the extent to which the present authorship attribution tasks may benefit from having access to more fine-grained probabilities or more coarse-grained class label predictions.
The use of author profiling probabilities from a given input document in ap.prob is illustrated as follows. Let us consider, for instance, a gender classifier that predicts that the author of the document has a 0.43 probability of being male, and hence a 0.57 probability of being female. In a binary classification task of this kind, both probabilities are taken as an input to the second-level classifier (that is, in addition to the existing probabilities predicted by the original word, char and distorted char classifiers.) Similarly, for ternary author profiling classes (e.g., education level etc.), the three probabilities are considered.
An example of how the input to the second-level classifier is represented in ap.prob is illustrated in Table 6, in which probabilities provided by the Word, Char (character) and text distortion (Dist) modules of the EACH-USP ensemble for an individual candidate author are appended to his/her gender probabilities. This creates a set of five probabilities associated with each author, which are to be submitted to the second-level authorship attribution classifier.
Regarding the use of class label predictions in ap.label, author profiling probabilities are replaced by class labels directly or, more specifically, by assigning the value 1 to the class of highest probability and 0 to all the others. Thus, for instance, the 0.57 probability of being female in the previous example would be replaced by a 1 value, and the 0.43 probability of being male would be replaced by 0. An example of this representation using a ternary class (Education e1.e3) for an individual author is illustrated in Table 7, in which the class of highest probability is assumed to be e2.
5. Evaluation
This section describes the evaluation of our present approach. First, we discuss how the corpora described in the previous sections were organised into non-overlapping training and test sets for each author profiling and authorship attribution tasks. Next, we describe the evaluation procedure proper, and details of how each of the two tasks were optimised and tested. In the case of the authorship attribution task, statistical significance is to be assessed using the McNemar’s test (McNemar Reference McNemar1947).
5.1 Train and test sets
Central to the current approach is the separation between data for our main task – authorship attribution – and for the auxiliary author profiling classifiers. Authorship attribution data consist of a set of train and test documents produced by 20 selected authors as discussed below, and it is labelled only with author identifiers. Author profiling data, by contrast, comprise all documents produced by other authors (i.e., those outside the 20-author group), and it is labelled with author demographics only (e.g., gender, age, etc.)
The organisation of each corpus into training and test sets takes into account the differences in granularity of the author profiling and authorship attribution tasks. The present author profiling models take as an input the set of all texts produced by an individual, which are concatenated as a single document labelled with their corresponding demographics. The authorship attribution models, by contrast, require multiple text samples from each candidate author (or else author identification would become trivial) and, as a result, take as an input individual text unities (i.e., Facebook and blog posts, sentences or tweets) as described in Section 4.1.
As the main focus of the present work is the authorship attribution task, we selected from each corpus the 20 authors with the largest volume of text available. These sets of 20 authors are taken to be the test data of our main authorship attribution approach in each domain and will be considered in a number of experiments in multiple test scenarios conveying from 2 to 20 candidates each. The choice for the authors with the largest possible amount of text data is intended to minimise situations in which the baseline approach in Custódio and Paraboni (Reference Custódio and Paraboni2019) may fail due to lack of data, which would have obscured the role of the author profiling classifiers as an aid to the authorship attribution task. The issue of input size in authorship attribution is addressed in detail in Rocha et al. (Reference Rocha, Scheirer, Forstall, Cavalcante, Theophilo, Shen, Carvalho and Stamatatos2017). Test set author profiling class distributions are illustrated in Table 8. We notice, however, that these class labels are not taken into account by the present approach and are presented only as a means to illustrate how author profiling estimates may help authorship attribution.
Test sets selected from each domain are naturally more balanced towards some classes than others and, as the comparison among author profiling classifiers (e.g., gender, age, etc.) will require a fixed test set for each domain, class imbalance may impact the results of the present authorship attribution approach. For instance, we notice using a gender classifier is arguably less helpful if, for example, most test authors turn out to be of the same gender. Keeping balanced test sets for all author profiling classes is, however, impractical for a number of reasons. First, we notice that this would require a large number of distinct author profiles to cover all possible class values (e.g., the Essay domain would require a test set consisting of at least 324 distinct authors selected out of a corpus containing only 510 individuals). Moreover, many profiles are considerably rare, or simply do not occur at all in the data (e.g., there are relatively few individuals who belong to the more extreme classes of education, politics and religiosity; most IT people tend to be male, etc.)
In order to minimise these difficulties, in the evaluation of our authorship attribution approach we will keep the natural author profiling class imbalance as is, and we will resort instead to multiple random tests as discussed in Section 5.3. The issue of class imbalance will also be revisited in the light of our results as discussed in Section 6.3.
Finally, leaving the 20 test authors aside, the remaining portion of each corpus is concatenated (i.e., disregarding author identifiers) and taken as training data for the auxiliary author profiling classifiers in each domain. Thus, the training data for author profiling does not include any text produced by the members of the 20-author group and, conversely, authorship attribution does not rely on the actual demographics information associated with the authors under identification, making instead its own predictions based on a disjoint data set.
The number of train and test documents (i.e., text units, cf. previous Table 5) for each task are summarised in Table 9, based on the largest possible (i.e., 20-author) evaluation setting.
5.2 Author profiling evaluation
Prior to the evaluation of the present authorship attribution approach, we built and evaluated its individual components, that is, the author profiling classifiers that could be built from the existing labels in each corpus. To this end, we performed univariate feature selection over development data using the ANOVA function and the F1 metrics to obtain the k-best text features (i.e., words) in each domain and language. Optimal values were searched within the 3000–20,000 features range at 1000 intervals and are summarised in Table 10.
Evaluation of the author profiling models was carried out by performing 10-fold cross validation over the training data and by considering a simple majority class baseline for illustration purposes. For all author profiling classifiers, we measure mean precision, recall, F1 and accuracy scores.
5.3 Authorship attribution evaluation
Multiple authorship attribution evaluation experiments were carried out by considering random sets of candidate authors drawn from the 20-author test set. With the exception of the Blog domain, tests were carried out by varying the number of candidate authors from 2 to 20. In the case of blogs, only tests involving 5, 10, 15 and 20 candidates were considered due to computational costs. As a means to obtain a balanced (authorship attribution) classification setting, the number of input texts taken from each candidate author is kept constant within each task by considering the smallest set size of the group.
In order to reduce the possible effects of random selection (e.g., in case of author profiling class imbalance when most authors turn out to belong to the same gender, etc.), evaluation was repeated 20 times by varying the candidates randomly, and also by randomly selecting different train and test documents. More specifically, we performed 20 runs * (3 corpora * 19 non-blog candidate set sizes) + 20 runs * (1 corpus * 4 blog candidate set sizes) experiments, making 1220 randomised authorship attribution evaluation tasks in total. Given the large number of evaluation scenarios, in what follows we will report overall mean results only.
The two versions of our current approach – ap.prob and ap.label – are to be compared against the standard EACH-USP ensemble baseline system in Custódio and Paraboni (Reference Custódio and Paraboni2019) whilst measuring mean accuracy scores for all models. In doing so, our goal is to verify whether using author profiling classifiers improves results over the original approach that does not have access to author demographics predictions.
6. Results
Results of the experiments described in the previous section are presented in two parts. Section 6.1 reports results for the individual author profiling classifiers, and Section 6.2 presents results for the authorship attribution task proper.
6.1 Author profiling results
Table 11 presents author profiling results for the four domains under consideration (Facebook, Essay, Blogs and Twitter, respectively) as obtained from test data using our current classifiers and a majority class baseline system. Best macro F1 scores for each class are highlighted.
Although not the main focus of the present work, this admittedly simple analysis should suffice to illustrate that the present author profiling classifiers obtain results considerably above a majority class selector and that this approach may arguably help improve results in the actual authorship attribution task in the same way that using ground truth gender information improved results in the pilot study described in Section 3.
6.2 Authorship attribution results
This section presents mean accuracy results obtained by our present authorship attribution approaches ap.prob and ap.label and by the baseline system. For each individual domain and task, best results are highlighted.
6.2.1 Blog domain
Table 12 presents mean accuracy scores for authorship attribution in the Blog domain as obtained by the EACH-USP baseline and by the ap.prob and ap.label models using age and gender classifiers. Best results for each candidate set (conveying 5, 10, 15 or 20 authors each) are highlighted. All differences between the baseline and the proposed models are significant ( $ p < 0.0001 $ ).
Results for the Blog domain suggest that, on average, using author profiling classifiers is superior to the standard authorship attribution method by a narrow but significant margin. In particular, the use of (ternary) age labels as predicted by ap.label outperforms the baseline and, to a lesser extent, it is also superior to the use of age probabilities as predicted by ap.prob. The use of gender classifiers is still useful if compared to the baseline, but the advantage is small, and mean results obtained by both ap.label and ap.prob are similar.
6.2.2 Facebook domain
Table 13 presents mean accuracy scores for authorship attribution in the Facebook domain as obtained by the EACH-USP baseline and by the ap.prob and ap.label models using age, gender and IT background classifiers. Best results for each candidate set (conveying from 2 to 20 authors each) are highlighted. All differences between the baseline and the proposed models are significant ( $ p < 0.0001 $ ).
The use of author profiling classifiers is consistently superior to the standard authorship attribution model alone and, in particular, using label predictions provided by the (ternary) age classifier is best of all. On the other hand, predictions made by the binary classifiers (gender and IT background) are less helpful, and the difference between ap.prob and ap.label is generally small.
6.2.3 Essay domain
Table 14 presents mean accuracy scores for authorship attribution in the Essay domain as obtained by the EACH-USP baseline and by the ap.prob and ap.label models (abbreviated to prob and label for ease of visualisation) using age, gender, IT background, political orientation, degree of religiosity and education level classifiers. Best results for each candidate set (conveying from 2 to 20 authors each) are highlighted. All differences between the baseline and the proposed models are significant ( $ p < 0.0001 $ ) except for the comparison between the baseline and ap.prob using the religiosity classifier.
Results for the Essay domain suggest that all author profiling classifiers help author identification by a sizeable margin. The single most successful strategy in this domain is the use of degrees of religiosity as predicted by ap.label. All ternary classifiers (religiosity, age, politics and education) outperform the binary classifiers for gender and IT background by a small margin, although binary classifiers are still significantly helpful if compared to the baseline. Moreover, we notice that the use of class labels as predicted by ap.label consistently outperforms both the baseline system and the use of probabilities in ap.prob in all scenarios.
6.2.4 Twitter domain
Authorship attribution results for the six Twitter data sets are divided into two tables for ease of visualisation, conveying three languages each. Table 15 summarises results for the Dutch, French and German corpora, and Table 16 concerns Italian, Portuguese and Spanish. In all cases, we report mean accuracy scores as obtained by the EACH-USP baseline and by the ap.prob and ap.label models using gender classifiers (recall that gender is the only kind of author demographics available from this domain, cf. Section 5.1.) Best results for each language and candidate set (conveying from 2 to 20 authors each) are highlighted. All differences between the baseline and the proposed models are significant ( $ p < 0.0001 $ ).
Results for the Twitter domain suggest that, once again, using author profiling (gender) classifiers help authorship attribution. With the exception of the Dutch corpus (for which both gender labels and probabilities produced similar mean results), the use of gender probabilities as predicted by ap.prob is slightly superior to using gender labels as predicted by ap.label.
6.3 Discussion
Although results in the previous section vary considerably across domains, tasks and languages, the use of author profiling predictions was found to consistently improve mean accuracy in the authorship attribution task in all corpora and in all settings under consideration, comprising 1220 randomised evaluation scenarios in total. In other words, using author demographics predictions always leads to a certain improvement over the standard method. This, in our view, supports the main research hypothesis of the present study.
Differences across experiments may have been influenced by multiple and possibly intertwined factors. First, there is the issue of author profiling accuracy. Some tasks (or some domains) may be simply more challenging than others, and this may explain, for instance, why the gain perceived by our method in the blog domain is consistently smaller than in the other text genres.
Second, as pointed out in Section 5.1, we notice that using author profiling predictions may be more helpful when the test set is balanced according to the predicted class. Related to this, class distribution may also explain why binary classifiers were generally less helpful than ternary classifiers: from an author identification perspective, having a set of candidate authors split into three classes may be simply more effective than having the same set split into two classes.
Differences between ap.prob and ap.label are generally small, but using author profiling label predictions in ap.label is still superior to using probabilities in most scenarios. The main exception is the Twitter domain, in which there is a certain advantage for ap.prob.
6.4 Using multiple author profiling classifiers simultaneously
Given that the use of individual author profiling classifiers always increases overall accuracy in the authorship attribution task, we may ask whether using multiple classifiers simultaneously may do even better. To shed light on this issue, we carried out a series of complementary experiments in which every possible ensemble combination of classifiers is attempted in every domain except for Twitter, which supports one single (i.e., gender) author profiling class.
6.4.1 Blog domain
Results for the Blog domain are summarised in Table 17, in which individual classifier results (top rows) are reproduced from the previous sections for ease of comparison with the combined alternatives (bottom). The overall best alternative (C1) is highlighted.
From these results, we notice that using age and gender classifier simultaneously improves overall results, and particularly so in the case of the ap.label strategy.
6.4.2 Facebook domain
Results for the Facebook domain are summarised in Table 18, once again showing the comparison between individual classifier results (top rows) and those obtained by their combinations (bottom). The overall best alternative (C4) is highlighted.
Results from Table 18 suggest that using all three classifiers simultaneously (C4) is considerably superior to the use of any individual classifier alone (on the top rows), or their other possible combinations (C1.C3). However, this outcome should be interpreted carefully given that stacking an arbitrary large number of classifiers may easily lead to overfitting (Custódio and Paraboni Reference Custódio and Paraboni2021). We notice, for instance, that the small difference between C4 and C3 may suggest that adding the IT background classifier (which has the lowest accuracy among the three individual options) is not necessarily helpful in the present case.
6.4.3 Essay domain
Finally, results for the Essay domain are summarised in Table 19, once again showing both individual classifier results (top) and the combined alternatives (bottom). Given the large number of possible combinations, the present analysis is limited to the best-performing classification pair (which turns out to be Gender + Religiosity), to which we attempted to add a third classifier only. In other words, combinations of four classifiers or more are presently not addressed. The overall best alternative (C1) is highlighted.
For the Essay domain, the best results were obtained by using two classifiers only (i.e., the C1 combination), which was found to consistently outperform all single- and three-classifier alternatives alike, and often by a considerable margin. This advantage, which is particularly striking in the case of the ap.prob strategy, significantly deteriorates with the addition of a third classifier. This outcome, in our view, once again suggests that individual author profiling classifiers need to be added judiciously to the ensemble architecture for optimal results.
6.5 The role of author demographics distribution
The observation that author demographics information may be naturally unbalanced – as it is indeed the case in the present data sets – gives rise to the question of how the use of author profile classifiers may contribute to the overall authorship attribution task in scenarios of different demographics distribution among the candidate authors. For instance, given a candidate set in which all authors belong to the same gender group, we would expect gender classification to be of little or no help and presumably greater in a more gender-balanced set.
To illustrate the role of author demographics distribution in our current authorship attribution approach, we carried out a complimentary analysis in situations with and without class balance. However, due to the data sparsity and inherent class imbalance in our data, the discussion that follows is limited to the case of gender classification in the Essay domain.
Table 20 presents authorship attribution results for unbalanced (i.e., single-gender) and balanced (i.e., with the same number of male and female individuals) candidate sets. These results are shown alongside the previously observed results from random distributions, which are presently reproduced from Section 6.2.3 for ease of visualisation.
Single-gender results (middle columns in Table 20) from both ap.prob and ap.label models are on average inferior to those obtained in random gender distribution (left columns). By contrast, balanced-gender results (right columns) are equal or slightly superior to those observed in random gender distribution. This pattern may also be visualised in Figure 5.
7. Final remarks
The present study has addressed the authorship attribution task of digital texts by extending an existing stack ensemble model with author profiling classifiers. This approach has been evaluated in a range of text domains, author profiling tasks and languages and was found to be consistently superior to a standard authorship attribution method that does not have access to author demographics predictions even though author demographics information is naturally imbalanced, and classifiers of this kind are generally suboptimal.
Using author profiling classifiers to aid authorship attribution was found to be particularly useful in the case of ternary classes, which have the effect of splitting the set of candidate authors into smaller subsets, and hence facilitate author identification. Moreover, the method appears to suit the stack authorship attribution strategy even when the accuracy of the author profiling task is relatively low, and it was found to boost overall results even further when multiple classifier are considered simultaneously. More research is, however, required to establish which classifiers may be combined in this way and how to guarantee optimal results with no risk of overfitting.
The experiments that were carried out give rise to the question of how to explain the present gains over the standard authorship attribution model. In the present two-level architecture (i.e., consisting of author profiling and authorship attribution levels), however, an analysis of this kind would require investigating the possible interactions between domains, languages, profiling tasks, corpus sizes, number of instances, class imbalance, number of candidate authors and others. For that reason, in the present work, we chose to minimise some of these issues by performing multiple random tests, and we leave a more detailed analysis along these lines to future work, which should seek to pinpoint the exact circumstances under which using author profiling predictions may or may not be useful, and also compare the present approach with other authorship attribution methods.
Finally, yet another important limitation of the present work is that evaluation was focused on candidate sets conveying up to 20 authors in each domain. This limitation, which is inherited from the authorship attribution model taken as the basis for our current work (e.g., from PAN-CLEF shared tasks and others), should also be addressed in future work. However, given the strong correlation between input size and authorship attribution accuracy (Rocha et al. Reference Rocha, Scheirer, Forstall, Cavalcante, Theophilo, Shen, Carvalho and Stamatatos2017), this will most likely require additional corpora with larger amounts of text samples per author.
Acknowledgement
The second author has received support from FAPESP grant 2016/14223-0.
Conflicts of interest
The authors declare none.