tightening the net a review of current and next
play

Tightening the net: a review of current and next generation spam - PDF document

Tightening the net: a review of current and next generation spam filtering tools James Carpinter & Ray Hunt Department of Computer Science and Software Engineering University of Canterbury Abstract IT infrastructure worldwide. While it


  1. Tightening the net: a review of current and next generation spam filtering tools James Carpinter & Ray Hunt ∗ Department of Computer Science and Software Engineering University of Canterbury Abstract IT infrastructure worldwide. While it is dif- ficult to quantify the level of spam currently This paper provides an overview of cur- sent, many reports suggest it represents sub- rent and potential future spam filtering ap- stantially more than half of all email sent and proaches. We examine the problems spam in- predict further growth for the foreseeable fu- troduces, what spam is and how we can mea- ture [18, 43, 30]. sure it. The paper primarily focuses on auto- For some, spam represents a minor irritant; mated, non-interactive filters, with a broad for others, a major threat to productivity. Ac- review ranging from commercial implemen- cording to a recent study by Stanford Univer- tations to ideas confined to current research sity [36], the average Internet user loses ten papers. Both machine learning and non- working days each year dealing with incoming machine learning based filters are reviewed as spam. Costs beyond those incurred sorting potential solutions and a taxonomy of known legitimate email from spam are also present: approaches presented. While a range of dif- 15% of all email contains some type of virus ferent techniques have and continue to be payload, and one in 3,418 emails contained evaluated in academic research, heuristic and pornographic images particularly harmful to Bayesian filtering dominate commercial filter- minors [54]. It is difficult to estimate the ulti- ing systems; therefore, a case study of these mate dollar cost of such expenses; however, techniques is presented to demonstrate and most estimates place the worldwide cost of evaluate the effectiveness of these popular spam in 2005, in terms of lost productivity techniques. and IT infrastructure investment, to be well Keywords: spam, ham, heuristics, over US$10 billion [29, 52]. machine learning, non-machine learning, The magnitude of the problem has intro- Bayesian filtering, blacklisting. duced a new dimension to the use of email: the spam filter. Such systems can be expen- sive to deploy and maintain, placing a further 1 Introduction strain on IT budgets. While the reduced flow of spam email into a user’s inbox is gener- The first message recognised as spam was sent ally welcomed, the existence of false positives to the users of Arpanet in 1978 and repre- often necessitates the user manually double- sented little more than an annoyance. Today, checking filtered messages; this reality some- email is a fundamental tool for business com- what counteracts the assistance the filter de- munication and modern life, and spam repre- livers. The effectiveness of spam filters to im- sents a serious threat to user productivity and prove user productivity is ultimately limited by the extent to which users must manually ∗ email: ray.hunt@canterbury.ac.nz 1

  2. review filtered messages for false positives. of current research. Section 4 details the eval- Unfortunately, the underlying business uation of spam filters, including a case study model of bulk emailers (spammers) is simply of the PreciseMail Anti-Spam system operat- too attractive. Commissions to spammers of ing at the University of Canterbury. Section 25–50% on products sold are not unusual [30]. 5 finishes the paper with some conclusions on On a collection of 200 million email addresses, the state of this research area. a response rate of 0.001% would yield a spam- mer a return of $25,000, given a $50 product. 1.1 Definition Any solution to this problem must reduce the profitability of the underlying business model; Spam is briefly defined by the TREC 2005 Spam Track as “unsolicited, unwanted email by either substantially reducing the number of emails reaching valid recipients, or increasing that was sent indiscriminately, directly or in- directly, by a sender having no current rela- the expenses faced by the spammer. Regrettably, no solution has yet been found tionship with the recipient” [12]. The key el- ements of this definition are expanded on in to this vexing problem. The classification task is complex and constantly changing. Con- a more extensive definition provided by Mail Abuse Prevention Systems [35], which spec- structing a single model to classify the broad range of spam types is difficult; this task ifies three requirements for a message to be is made near impossible with the realisation classified as spam. Firstly, the message must be equally applicable to many other potential that spam types are constantly moving and evolving. Furthermore, most users find false recipients (i.e. the identity of the recipient and the context of the message is irrelevant). positives unacceptable. The active evolution of spam can be partially attributed to chang- Secondly, the recipient has not granted ‘delib- erated, explicit and still-revocable permission ing tastes and trends in the marketplace; how- ever, spammers often actively tailor their mes- for it to be sent’. Finally, the communica- tion of the message gives a ‘disproportionate sages to avoid detection, adding a further im- pediment to accurate detection. benefit’ to the sender, as solely determined by the recipient. Critically, they note that sim- The similarities between junk postal mail ple personalisation does not make the identity and spam can be immediately recognised; however, the nature of the Internet has al- of the sender relevant and that failure by the user to explicitly opt-out during a registration lowed spam to grow uncontrollably. Spam can be sent with no cost to the sender: the process does not form consent. economic realities that regulate junk postal Both these definitions identify the predomi- mail do not apply to the internet. Further- nant characteristic of spam email: that a user more, the legal remedies that can be taken receives unsolicited email that has been sent against spammers are limited: it is not diffi- without any concern for their identity. cult to avoid leaving a trace, and spammers easily operate outside the jurisdiction of those 1.2 Solution strategies countries with anti-spam legislation. Proposed solutions to spam can be separated The remainder of this section provides sup- porting material on the topic of spam. Sec- into three broad categories: legislation, pro- tocol change and filtering. tion 2 provides an overview of spam classifi- cation techniques. Sections 3.1 and 3.2 pro- A number of governments have enacted leg- vide a more detailed discussion of some of the islation prohibiting the sending of spam email, spam filtering techniques known: given the including the USA (Can Spam Act 2004) rapidly evolving nature of this field, it should and the EU (directive 2002/58/EC). Ameri- be considered a snapshot of the critical areas can legislation requires an ‘opt-out’ list that 2

Recommend


More recommend