Hostname: page-component-745bb68f8f-b95js Total loading time: 0 Render date: 2025-01-10T05:48:32.196Z Has data issue: false hasContentIssue false

Automated unsupervised authorship analysis using evidence accumulation clustering

Published online by Cambridge University Press:  21 November 2011

ROBERT LAYTON
Affiliation:
Internet Commerce Security Laboratory, University of Ballarat, Australia e-mails: r.layton@icsl.com.au, p.watters@ballarat.edu.au
PAUL WATTERS
Affiliation:
Internet Commerce Security Laboratory, University of Ballarat, Australia e-mails: r.layton@icsl.com.au, p.watters@ballarat.edu.au
RICHARD DAZELEY
Affiliation:
Data Mining and Informatics Research Group, University of Ballarat, Australia e-mail: r.dazeley@ballarat.edu.au

Abstract

Authorship Analysis aims to extract information about the authorship of documents from features within those documents. Typically, this is performed as a classification task with the aim of identifying the author of a document, given a set of documents of known authorship. Alternatively, unsupervised methods have been developed primarily as visualisation tools to assist the manual discovery of clusters of authorship within a corpus by analysts. However, there is a need in many fields for more sophisticated unsupervised methods to automate the discovery, profiling and organisation of related information through clustering of documents by authorship. An automated and unsupervised methodology for clustering documents by authorship is proposed in this paper. The methodology is named NUANCE, for n-gram Unsupervised Automated Natural Cluster Ensemble. Testing indicates that the derived clusters have a strong correlation to the true authorship of unseen documents.

Type
Articles
Copyright
Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abbasi, A. and Chen, H. 2005. Applying authorship analysis to extremist-group web forum messages. IEEE Intelligent Systems 20 (5): 6775.CrossRefGoogle Scholar
Abbasi, A. and Chen, H. 2008. Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems 26 (2): 7:1–7:29.Google Scholar
Alazab, M., Venkataraman, S. and Watters, P. 2010. Towards understanding malware behaviour by the extraction of API calls. In Proceedings of the Cybercrime and Trustworthy Computing Workshop, Ballarat, Australia, July 9–10, pp. 52–9.Google Scholar
Argmamon, S., Koppel, M., Pennebaker, J. and Schler, J. 2009. Automatically profiling the author of an anonymous text. Communications of the ACM 52: 119–23.CrossRefGoogle Scholar
Aston, M., McCombie, S., Reardon, B., and Watters, P. 2009. A preliminary profiling of internet money mules: an Australian perspective. In Symposia and Workshops on Ubiquitous, Autonomic and Trusted Computing, 2009 (UIC-ATC'09), Los Alamitos, CA, USA, pp. 482–7. IEEE Computer Society.CrossRefGoogle Scholar
Cavnar, W. B. 1994. Using an n-gram-based document representation with a vector processing retrieval model. In Proceedings of the Text REtrieval Conference (TREC-3), Gaithersburg, MD, USA, November 2–4 (NIST).Google Scholar
Chen, Y.-D., Abbasi, A., and Chen, H. 2010. Framing social movement identity with cyber-artifacts: a case study of the International Falun Gong Movement. Security Informatics, 9: 123 (Springer).CrossRefGoogle Scholar
Cohen, D. and Narayanaswamy, K. 2004. Survey/analysis of Levels I, II, and III attack attribution techniques. Technical Report, Cs3 Inc, Memphis, TN, USA.Google Scholar
Duarte, J., Fred, A., Lourenço, A., and Duarte, F. 2010. On consensus clustering validation. In Structural, Syntactic, and Statistical Pattern Recognition, Lecture Notes in Computer Science, vol. 6218. Berlin: Springer, pp. 385–94.CrossRefGoogle Scholar
Frantzeskou, G., Stamatatos, E., Gritzalis, S. and Chaski, C. E. 2007. Identifying authorship by byte-level n-grams: The source code author profile (SCAP) method. International Journal of Digital Evidence 6. www.ijde.orgGoogle Scholar
Fred, A. and Jain, A. 2002. Evidence accumulation clustering based on the k-means algorithm. Structural, Syntactic, and Statistical Pattern Recognition, Lecture Notes in Computer Science, vol. 6218. Berlin: Springer, pp. 303–33.Google Scholar
Gao, H., Zhu, D. and Wang, X. 2010. A parallel clustering ensemble algorithm for intrusion detection system. In Proceedings of International Symposium on Distributed Computing and Applications to Business, Engineering and Science, Cambridge, MA, USA, September 13–15, pp. 450–3.Google Scholar
Ghaemi, R., Sulaiman, Md. N., Ibrahim, H., and Mustapha, N. 2009. A survey: clustering ensembles techniques. Proceedings of World Academy of Science, Engineering and Technology 38: 20703740.Google Scholar
Holmes, D. 1992. A stylometric analysis of Mormon scripture and related texts. Journal of the Royal Statistical Society. Series A (Statistics in Society) 155 (1): 91120.CrossRefGoogle Scholar
Holmes, D. I. 1994. Authorship attribution. Computers and the Humanities 28 (2): 87106.CrossRefGoogle Scholar
Huber, P. J. and Ronchetti, E. 1981. Robust Statistics, 2nd ed.Wiley Online Library.CrossRefGoogle Scholar
Iqbal, F., Binsalleeh, H., Fung, B. C. M. and Debbabi, M. 2010. Mining writeprints from anonymous e-mails for forensic investigation. Digital Investigation 7 (1–2): 5664.CrossRefGoogle Scholar
Juola, P. 2004. Ad-hoc authorship attribution competition. In Proceedings of 2004 Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities (ALLC/ACH 2004), Goteborg, Sweden, June 11–16, pp. 175176.Google Scholar
Juola, P. 2008. Authorship Attribution. Hanover, MA: Now Publishing.Google Scholar
Kanich, C., Kreibich, C., Levchenko, K., Enright, B., Voelker, G. M., Paxson, V., and Savage, S. 2008. Spamalytics: an empirical analysis of spam marketing conversion. In Proceedings of the 15th ACM Conference on Computer and Communications Security, pp. 314. ACM.CrossRefGoogle Scholar
Kešelj, V., Peng, F., Cercone, N., and Thomas, C. 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the Pacific Association for Computational Linguistics, pp. 255–264.Google Scholar
Koppel, M. and Schler, J. 2004. Authorship verification as a one-class classification problem. In Proceedings of the Twenty-First International Conference on Machine Learning (ICML '04), pp. 62–68. ISBN 1-58113-838-5.Google Scholar
Layton, R., Watters, P. and Dazeley, R. 2010. Authorship attribution for twitter in 140 characters or less. In 2010 Second Cybercrime and Trustworthy Computing Workshop, Los Alamitos, CA, USA, pp. 18. IEEE Computer Society.Google Scholar
Layton, R., Watters, P. and Dazeley, R. 2011a. Automatically determining phishing campaigns using the USCAP methodology. In eCrime Researchers Summit (eCrime), 2010, Los Alamitos, CA, USA, pp. 18. IEEE Computer Society.Google Scholar
Layton, R., Watters, P. and Dazeley, R. 2011b. Recentred local profiles for authorship attribution. Journal of Natural Language Engineering. doi: 10.1017/S1351324911000180 Available on CJO 2011. http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=8296826&fulltextType=RA&fileId=S1351324911000180Google Scholar
Li, J., Zheng, R. and Chen, H. 2006. From fingerprint to writeprint. Communications of the ACM 49: 7682.CrossRefGoogle Scholar
Luyckx, K. and Daelemans, W. 2010. The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing 26: 3555.CrossRefGoogle Scholar
McCombie, S., Watters, P., Ng, A., and Watson, B. 2008. Forensic characteristics of phishing – petty theft or organized crime? WEBIST 1: 149–57.Google Scholar
Mohtasseb, H. and Ahmed, A. 2009. Mining online diaries for blogger identification. Proceedings of the World Congress on Engineering 1: 295302.Google Scholar
Moore, T. and Clayton, R. 2007. Examining the impact of website take-down on phishing. In Proceedings of the IEEE 2nd Annual eCrime Researchers Summit (eCrime '07), Los Alamitos, CA, USA, pp. 113. IEEE Computer Society.Google Scholar
Mosteller, F. and Wallace, D. L. 1963. Inference in an authorship problem. Journal of the American Statistical Association 58 (302): 275309.Google Scholar
Novak, J., Raghavan, P. and Tomkins, A. 2004. Anti-aliasing on the web. In Proceedings of the 13th International Conference on World Wide Web, pp. 30–9. ACM.CrossRefGoogle Scholar
Parag, T. and Elgammal, A. M. 2009. A voting approach to learn affinity matrix for robust clustering. In Proceedings of the International Conference on Image Processing (ICIP), Cairo, Egypt, November 7–10, pp. 2409–12.Google Scholar
Project Gutenberg Organisation. 2011. Project Gutenberg. http://www.gutenberg.org/Google Scholar
Radvanovsky, B. 2006. Analyzing spoofed email headers. Journal of Digital Forensic Practice 1: 231–43.CrossRefGoogle Scholar
Raghavan, S., Kovashka, A. and Mooney, R. 2010. Authorship attribution using probabilistic context-free grammars. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL-2010), Association for Computational Linguistics, pp. 38–42.Google Scholar
Rijsbergen, C. J. Van.. 1979. Information Retrieval, 2nd ed.Newton, MA: Butterworth-Heinemann.Google Scholar
Rosenberg, A. and Hirschberg, J. 2007. V-measure: a conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, June 28–30, pp. 410–20.Google Scholar
Rousseeuw, P. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20: 5365.CrossRefGoogle Scholar
Sokal, R. and Rohlf, F. J. 1962. The comparison of dendrograms by objective methods. Taxon 11 (2): 3340.CrossRefGoogle Scholar
Stamatatos, E. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60: 538556.CrossRefGoogle Scholar
Steinbach, M., Karypis, G. and Kumar, V. 2000. A comparison of document clustering techniques. In Proceedings of KDD Workshop on Text Mining, 400: 525–6. Citeseer.Google Scholar
Turville, K., Yearwood, J. and Miller, C. 2010. Understanding victims of identity theft: preliminary insights. Proceedings of the Cybercrime and Trustworthy Computing Workshop, Ballarat, Australia, July 19–20, pp. 60–8.Google Scholar
Urvoy, T., Chauveau, E., Filoche, P. and Lavergne, T. 2008. Tracking web spam with html style similarities. ACM Transactions of the Web 2 (1): 128.CrossRefGoogle Scholar
Vlachos, A., Korhonen, A. and Ghahramani, Z. 2009. Unsupervised and constrained Dirichlet process mixture models for verb clustering. In Proceedings of the Workshop on Geometrical Models of Natural Language Semantics, Association for Computational Linguistics, pp. 74–82.Google Scholar
Watters, P. A. and McCombie, S. 2011. A methodology for analyzing the credential marketplace. Journal of Money Laundering Control 14 (1): 3243. ISSN .CrossRefGoogle Scholar
Xu, R. and Wunsch, D. II 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16: 645.CrossRefGoogle ScholarPubMed
Zheng, R., Li, J., Chen, H. and Huang, Z. 2005. A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology 57: 378–93.CrossRefGoogle Scholar
Zheng, R., Qin, Y., Huang, Z. and Chen, H. 2003. Authorship analysis in cybercrime investigation. In Lecture Notes in Computer Science, vol. 2665, pp. 5973. Berlin: Springer.Google Scholar