Academia.eduAcademia.edu
Toward Automated E-cigarette Surveillance: Spotting E-cigarette Proponents on Twitter ∗ Division Ramakanth Kavuluru∗† and AKM Sabbir† of Biomedical Informatics, Dept. of Biostatistics, University of Kentucky, Lexington, KY † Department of Computer Science, University of Kentucky, Lexington, KY {ramakanth.kavuluru, akm.sabbir}@uky.edu Abstract—Electronic cigarettes (e-cigarettes or e-cigs) are a popular emerging tobacco product. Because e-cigs do not generate toxic tobacco combustion products that result from smoking regular cigarettes, they are sometimes perceived and promoted as a less harmful alternative to smoking and also as means to quit smoking. However, the safety of e-cigs and their efficacy in supporting smoking cessation is yet to be determined. Importantly, the federal drug administration (FDA) currently does not regulate e-cigs and as such their manufacturing, marketing, and sale is not subject to the rules that apply to traditional cigarettes. A number of manufacturers, advocates, and e-cig users are promoting e-cigs on Twitter and according to a recent estimate obtained through Twitter Inc., there are about 31,000 e-cig related tweets per day. In this paper we develop a high accuracy supervised predictive model (precision 97%, recall 86%) to automatically identify e-cig “proponents” on Twitter. Analyzing their corresponding tweets from a large corpus, we find that as opposed to regular tweeters that form over 90% of the dataset, e-cig proponents are a smaller subset but tweet two to five times more than regular tweeters and also disproportionately highlight e-cig flavors, their smoke-free and potential harm reduction aspects, and their claimed use in smoking cessation. Given FDA is currently in the process of proposing meaningful regulation, we believe our work demonstrates the strong potential of machine learning approaches for automated e-cig surveillance on Twitter. I. I NTRODUCTION Electronic cigarettes (e-cigarettes or simply e-cigs) were introduced in the United States (US) in 2007 [1] and are currently a popular emerging tobacco product across the world. An e-cig essentially consists of a battery that heats up liquid nicotine available in a cartridge into a vapor that is inhaled by the user [2]. E-cig users are termed vapers and the process of using an e-cig is called vaping. E-cigs are similar to conventional tobacco cigarettes with regards to visual, sensory, and behavioral aspects and hence were observed to reduce craving [3]. Owing to their recent introduction, there are very few studies on e-cig safety, risk of abuse, and their efficacy as a smoking cessation aid especially about long term use effects. In fact, currently the search phrase electronic nicotine delivery systems OR e-cigarette OR electronic cigarette yields 553 articles in the PubMed search system out of which 475 (or 86%) had dates of publication in 2014 or 2015. Because e-cigs do not generate toxic combustion products that are produced with tobacco cigarettes, they are perceived and also sometimes marketed as suitable alternatives for smoking cessation [4]. However, scientific research to verify these claims is limited and is often inconclusive. On one hand there are studies that indicate comparable or superior effectiveness of e-cigs in smoking cessation [5], [6]. However, there are also results [7], [8] that show no such associations exist between e-cig use and quitting or reduced conventional cigarette consumption. Another recent effort [9] also indicates that passive exposure to e-cigs increases the desire to smoke both regular cigarettes and e-cigs. Nevertheless, current research seems to indicate that they are less harmful than traditional cigarettes [10]. The ongoing healthy scientific debate around e-cigs is welcomed by the society, especially by regular smokers who are interested in quitting or adopting less harmful alternatives. However, lack of FDA regulation (except for therapeutic use) has heavily increased marketing of e-cigs on the Web [11] and through television ads [12] even if individual states have recently started to enact their own regulations to limit sales, marketing, and use [13]. According to a 2013 Centers for Disease Control and Prevention (CDC) report [14], e-cig consumption doubled in middle and high school students from 2011 to 2012. Furthermore, 9.3% of middle and high school ever e-cig users in 2012 have never smoked conventional cigarettes. Alarmingly, this percentage goes up to 20.3% when considering only middle school students. A more recent CDC report [15] shows that e-cig use tripled from 2013 to 2014 among middle and high school students. Since long term safety of e-cigs has not been thoroughly studied yet, the prospect of adolescents developing nicotine dependence could be detrimental to public health in future generations. When considering adult smokers, however, the significant increase in e-cig awareness has reduced their perception of e-cigs as being less harmful compared with regular cigarettes [16]. Since Web based advertising and discussion still plays a major role in e-cig marketing and use and given one in four online US teenagers uses Twitter [17], we believe it is critical to study the landscape of e-cig messages and their authors on Twitter. Although e-cig message themes and author classification might be highly granular, in this pilot project we take a simpler approach to tweet author classification – each tweeter is either a “proponent” or not for our purposes. Proponents are tweeters who represent e-cig sales or marketing agencies, individuals who advocate e-cigs, or tweeters who specifically identify themselves as vapers in their profile bio. Essentially these tweeters are generally more inclined to support e-cigs regardless of their specific motivation (e.g., business, lobbying, smoking cessation). In this paper, based on a hand-labeled dataset of 1000 tweeter profiles, we build machine learned models to automatically identify proponents. We subsequently use this model to analyze the content of tweets generated by proponents in comparison with other tweeters along several well known e-cig themes (e-cig flavors, harm reduction, smoke-fee aspect, and smoking cessation) using straightforward text processing. We demonstrate that proponents are many times as likely to highlight the attractive (and sometimes scientifically not yet verified) aspects of e-cigs compared with regular tweeters. To our knowledge this is the first attempt in identifying proponents and a first step in building a framework for automatic surveillance of e-cig related chatter. of such tweets mention smoking cessation. In our effort, instead of identifying commercial tweets, we identify e-cig proponent Twitter profiles using supervised machine learning. We believe this is a more direct approach that aids in electronic surveillance efforts needed in the immediate future to monitor e-cig marketing/publicity practices and as such complements Huang et al.’s effort. Furthermore, compared with 2012, there is an order of magnitude increase in the number of e-cig tweets (based on an official quote from Twitter Inc.) and our experiments are conducted on a corpus of over one million e-cig related tweets. After identifying proponent profiles, we conducted analyses along well known e-cig themes to see differences between tweeting behaviors of proponents and regular tweeters. II. BACKGROUND AND R ELATED W ORK We used two different datasets of e-cig related tweets: the first set of 224,000 tweets was obtained using rate limited Twitter streaming API during the months of September to December 2013 and a second dataset of nearly one million tweets (purchased from the exhaustive Twitter firehose) for the month of March 2015 that match the query terms: electronic-cigarette, e-cig, e-cigarette, e-juice, e-liquid, vape-juice, and vape-liquid. Variants of these terms with spaces instead of hyphens or just without the hyphens (for matching hashtags) were also used in the query. These terms were chosen in consultation with faculty members in the College of Nursing at the University of Kentucky (UKY) who work on e-cig policy research. They are specific enough and empirically shown to result in a 99% match to actual e-cig related tweets [33]. The juice/liquid terms represent the liquid nicotine cartridges that need to be refilled for the vaping devices. The smaller dataset represents a free sample curated during a four month period and the second larger dataset constitutes a full dataset from one month. Thus, we get a recent estimate of around 31,000 e-cig tweets per day (as of March 2015) compared with about 1200 such tweets per day in mid 2012 [33], indicating a 25 fold increase. Our supervised prediction of proponents on Twitter relies on each tweeter’s Twitter username, profile bio/description, and most recent tweets. The username or handle is unique to each tweeter and can be upto 15 characters in length. Optionally, each tweeter can also choose to write a profile bio of at most 160 characters to characterize his/her persona on Twitter. If a tweeter never tweets about e-cigs (matches none of our query terms) during a period of surveillance, we automatically assume that they are not a proponent. Thus only those tweeters who have authored at least one e-cig tweet are considered as candidates for classification. Due to reasons that we elaborate later, we classify profiles with empty bios differently compared with those that have non-empty profile descriptions. Based on a sample of 34,000 unique users with at least one e-cig tweet during our four month sample collection, we determine that approximately 20% of e-cig tweeters have empty profile descriptions where the tweeters choose not to Since its introduction in 2006, Twitter has grown into one of the top 15 visited websites [18] in the world with 100 million daily active users who generate over 500 million tweets per day [19]. The asymmetric network structure of Twitter inherently supports information diffusion and given that a recent study [20] reveals that over 95% of Twitter profiles are public, mining tweets is a practical tool to measure user engagement with various events and products. Since users are not required to publicly declare personal information, several recent studies have focused on identifying user demographic attributes such as age groups and life stages [21], gender [22], and race and ethnicity [23]. In the context of public health, Twitter based automatic syndromic surveillance has been shown to have high correlation with traditional surveillance methods [24], [25] with the added advantage of near real time access to trends, especially in the early epidemic stages [26], [27]. Recent efforts also noted Twitter’s suitability for promoting health literacy [28], encouraging fitness activity [29], and monitoring drug safety [30]. In the context of tobacco control advocacy, researchers found significant reach through Twitter in obtaining signatures for an online petition to drop tobacco sponsorship for an international music concert in Indonesia [31]. Another recent Twitter based study focused on emerging tobacco products [32] found high prevalence of positive sentiment for hookah and e-cig. It also successfully demonstrated the application of machine learning methods in automatically identifying tobacco related tweets, constituent themes, and sentiments. The most relevant effort in the context of our paper is from Huang et al. [33] who automatically identify “commercial” tweets from a corpus of nearly 73,000 tweets collected in the months of May and June in 2012. For their purposes tweets that contain links to sales websites and promotional messages are all commercial regardless of who posts them (regular tweeters vs e-cig marketers/advocates). They use DiscoverText, a cloud based commercial text analytics software program, to semi-automatically (Naive Bayes with additional heuristics) classify tweets as commercial or not. They report that 90% of their tweets are commercial and nearly 10% III. DATASETS AND A NNOTATION Feature description Presence in best combination unigrams and bigrams from bio and recent tweets ✓ Part of speech tags of bio and recent tweets ✓ Average positive and negative polarity scores of the bio and recent tweets ✓ Topic distribution of bio: Based on 10 topics for bios and 20 topics for recent tweets generated using LDA based on training dataset profiles ✓ Presence of user mentions, URLs, and punctuation marks in the bio ✗ Length of the bio and average length of recent tweets; also binarized versions of these features: whether the lengths are above the averages determined on training data ✗ Presence of the following terms as substrings in user name: vape, vapor, vapour, vaping, ecig, eliquid, ejuice (e.g., @ecighunter, @askavaper, @vapeclub) ✓ TABLE I: Feature groups explored for predicting e-cig proponents provide a bio. Our supervised approach is designed for the 80% of e-cig tweeters with non-empty bios. We use a simple unsupervised approach for tweeters with empty bios. Next, we outline the training dataset creation for classifying tweeters with non-empty bios. We randomly chose 1000 tweeter profiles with non-empty bios from our corpus of e-cig tweets. Two annotators independently annotated each of those profiles with a positive (proponent) label if the bio indicates that they are in e-cig sales/marketing or if they are advocates or vapers. This seemed reasonable given our manual overview of the dataset indicated that bio text is often a strong indicator of tweeter perception of e-cigs. For example, the following are a few real bios from our dataset. • • • • • • “a vaper trying to help other find that sweet state of vape.” “I love e-cigs. If anyone wants to buy, I can hook you up.” “a Finnish vaping advocate who semi occasionally enjoys rambling on youtube” “Proud vaper and constitutionalist. I support vaping and the constitution as is was written.” “Dedicated to bringing the best deals on vaping needs.” “Manufacturer and distributor of American made premium vape Juice. We also sell e-cigarettes and supplies from around the world.” If the bio does not give enough evidence to make a conclusive decision, the annotators looked at the recent 200 tweets generated from that profile to see if they are predominantly favoring e-cigs. We note that tweeters who retweet e-cig favoring tweets are also considered proponents even if they are not original authors of those tweets. We reiterate that for our purposes proponents are tweeters who are generally more inclined to support e-cigs regardless of their specific motivation (e.g., business, lobbying, smoking cessation). We obtained substantial inter annotator agreement (κ = 0.88) and after ignoring the 43 profiles where the annotators disagreed, we have 957 profiles in the final dataset with 216 proponents and 741 in the negative class. IV. P ROPONENT C LASSIFICATION M ODEL Given the short size of the profile bio we just used the logistic regression (LR) classifier in the Python Scikit-Learn [34] machine learning library. As indicated in Section III, we use features extracted from the username, bio, and recent tweets to build the classification model. We split our dataset into 70% training and 30% test splits with stratified sampling of both classes. Using average four fold cross-validation F-scores computed over 200 distinct shuffles on the training dataset, we identified the best feature combination from all the feature types we experimented with as shown in Table I. Unigrams and bigrams are traditional n-gram features typically used in text classification. Besides this, we also use parts of speech of different tokens in both bio and recent tweets’ text. In addition to this we also used sentiment score features from bio and recent tweets. For a given bio (or a set of recent tweets taken as a single text blob), our features are the average positive and negative scores of the text computed over all sentiment words in it using the scores available in SentiWordNet 3.0 [35]. We also included topic modeling based features using bios and recent tweets. The central idea is to apply latent Dirichlet allocation [36] modeling using the MALLET [37] toolkit to the bio text from all profiles in the training dataset and at the test time “fold in” a new bio into the model to infer probabilities P (ti |b) of the topic ti given the new bio b. Based on experiments we determined ten is the ideal number of topics for the bios and twenty topics for modeling the recent tweets. This means that for each bio, we will have as features P (t1 |b), . . . , P (t10 |b) for the new bio b. Similarly, for each set of recent tweets we have twenty features based on topic modeling. We also used binary features to incorporate presence of user mentions, URLs, and certain punctuation marks (e.g., !, ?) in the bio. However, these features did not end up in best feature combination. Similarly length of the bio and average length of the recent tweets also did not make it into the final model based on cross validation experiments. Another important feature is the presence of e- cig related keywords appearing as substrings of the tweeter user name as shown in the final row of Table I. A key finding in our initial experiments was that simply using recent tweets based and profile bio based features together in the same model is not very helpful since crucial predictive signal in the bio text (being short, about 160 characters) was getting drowned out by features from the tweeter’s recent tweets even when we considered only 10 or 20 recent tweets in our experiments. To counter this we trained two separate LR models: one based on just the bio and username based features (say, M b ) and the second one based only on recent tweets’ features (say, M r ). The final prediction model M is the weighted average of positive class probability estimates output by both models. That is PM = αPM b + (1 − α)PM r , where α ∈ [0, 1]. We determined that α = 0.85 and the recent 20 tweets to be the best configuration for this weighted averaging approach by maximizing cross validation F-scores on the training data. Specifically, to find optimum α we did a simple grid search with 0.1 increments starting with α = 0.1 and noticed a maximum F-score at a value of 0.8 for α with dips on either side at 0.7 and 0.9. Then we conducted another grid search experiment with 0.01 increments in the range 0.7 < α < 0.9 and found a new maximum F-score value at α = 0.85. Thus we use α = 0.85 in all our experiments in the rest of the paper. Intuitively, this means that profile bio is nearly six times more important than recent tweets in spotting proponents, which is not surprising given our single model with both feature types resulted in very poor performance. Using the best feature combination as represented in Table I and the weighted model averaging approach with α = 0.85, we trained a model on the training data and tested on 30% test set to obtain a F-score of 0.896 (with precision 0.96 and recall 0.84). Since this is the score on a single run, we considered 500 distinct shuffles of our full dataset. For each shuffle, we used stratified sampling (maintaining class proportions) to split it into 80%-20% train-test sets and using the best combination, trained on the 80% set and tested on the 20% set. Using this approach and descriptive statistics, we obtain a mean F-score 0.9152 (0.9147 – 0.9157), mean precision 0.9721 (0.9716 – 0.9726), and mean recall 0.8663 (0.8656 – 0.8671) with 95% confidence intervals shown in parentheses. Using 500 distinct shuffles of the dataset, we ran feature ablation experiments where we removed one feature at a time to measure the performance drop incurred, which indicates the contribution of that feature to the overall model. From ablation results shown in Table II we see that dropping recent tweet features causes biggest drop in recall and dropping the username substring feature (last row of Table I) causes the biggest drop in precision and F-score. POS tag ablation causes a small drop in performance compared with polarity score and topic distribution score removal. This is not surprising because the tweets and bios of proponents seemed to largely revolve around e-cig themes while regular tweeters discuss varied topics and are not focused on e-cigs. For all features, the drop in recall is always larger than the drop in precision. Precision Recall F-score Full model 0.9721 0.8663 0.9152 – POS tags 0.9596 0.8458 0.8979 – Polarity scores 0.9361 0.7838 0.8514 – Topic scores 0.9342 0.7869 0.8524 – Username 0.8900 0.7802 0.8295 – Recent tweets 0.9229 0.7733 0.8394 TABLE II: Feature ablation results with averages computed over 500 distinct shuffles of the dataset Overall our efforts have resulted in a very high precision model with reasonable recall for profiles with non-empty bios. Specifically, the feature that incorporates e-cig related terms as substrings of tweeter username turns out to be a powerful predictor for spotting proponents. We use this specific feature in an unsupervised fashion to identify proponents for tweeters with empty bios (about a fifth of all e-cig tweeters). That is, for profiles with empty bios we simply see if certain e-cig related terms (last row of Table I) are in the username and if they are present, we classify corresponding tweeters as proponents. A manual examination of 500 profiles with user names that have these terms as substrings reveals a 99% precision of this approach in identifying proponents for tweeters with empty bios. However, at this point, we don’t have a methodical way to do a supervised approach for identifying proponents with empty bios that do not fit this simple criterion. Still, most users whose bios are empty and do not match this user name based rule appear to be regular non-proponent tweeters based on a manual examination of a sample of such profiles. V. A NALYSIS OF P ROPONENT T WEETS In this section, we analyze the tweets generated by proponents and other tweeters along familiar e-cig themes. Before we proceed, we introduce a slight modification to the way we apply the model built in Section IV. We analyzed the confusion matrix of the test set predictions of our model and noticed that there were several false negatives which could have been captured by applying the simple user name substring match approach used for empty profiles (last row of Table I). Given this particular substring based classification yields near perfect precision (see Section IV), for the rest of the paper we apply our model to non-empty profiles that do not match the user name substring match. However, we note that this user name based identification approach is not comprehensive because proponents do not always use such user names. In our experiments with two datasets in the rest of the section, our full model yields 25-35% (depending on the dataset used: sample vs exhaustive) more proponents compared with the simpler user name based identification. Tweeters Dataset Tweets Total Proponents Total Proponents 2013 Sep-Dec sample 34,000 2540 (7.5%) 224,000 32,682 (14.6%) 2015 March subset 100,000 4359 (4.3%) 349,401 72,384 (20.7%) TABLE III: Proponent and corresponding tweet counts in both datasets 2013 Sep-Dec sample tweets E-cig theme 2015 March subset tweets Total By props (+RTs) Rate ratio (+RTs) Total By props (+RTs) Rate ratio (+RTs) 4018 2258 (3175) 15 (46) 10,855 5207 (6400) 20 (31) 374 193 (212) 13 (16) 1527 1118 (1227) 60 (91) Smoke-free aspect 1902 1033 (1350) 14 (30) 11,220 7380 (7657) 42 (47) Smoking cessation 5228 1923 (2991) 7 (16) 5863 3820 (4532) 41 (75) Flavors Harm reduction TABLE IV: Thematic distribution of e-cig tweets for proponents vs other tweeters To analyze the tweets from proponents compared with the rest of the users, we apply our model to the tweeters in two different datasets introduced in Section III. The first dataset is a free rate limited sample from a four month period (September to December of 2013) and the second dataset is an exhaustive dataset of nearly one million tweets for the month of March 2015. This newer one month dataset has nearly 360,000 tweeters. Given our model depends also on recent tweets and Twitter imposes prohibitive rate limits for collecting recent tweets in a timely fashion, we chose to look at 100,000 randomly chosen tweeters and their tweets (nearly 350,000) in the new dataset. The old free API based dataset has 34,000 tweeters and 224,000 tweets in the sample generated by them. Although we consider a subset of the tweeters in the new dataset, we believe this captures a different perspective given we are more likely to hit frequent tweeters with the rate limited free sampling approach but are likely to incorporate more infrequent tweeters with the selected subset from the one month dataset. After applying our classification model to these tweeters in both datasets, we obtain results as shown in Table III. We notice that the percentage of proponents is 3% more in the older sample compared to the subset of the exhaustive one month dataset. This is not surprising given the subset of the exhaustive sample is more likely to capture tweets from infrequent tweeters who are occasionally tweeting about e-cigs. The final two columns of Table III show the total number of tweets and corresponding sizes of the subsets generated by the proponents. From the first row, for the rate limited sample from 2013, we notice that on average a proponent generates twice as many tweets as other tweeters. Based on the second row we notice that in the 2015 sample, on average proponents tweet more than fives time as other tweeters. We compute this by simply comparing the average number of tweets by proponents (72384/4359 = 16.6) with those by regular tweeters: (349401 − 72384)/(100000 − 4359) = 2.9. We indicated in Section III that the number of tweets per day on e-cigs has increased 25 times compared with the rate in 2012. Here we also observed that the tweets by proponents are also increasing considerably compared with those by regular tweeters. Next we focus on tweet content analysis based on four popular e-cig themes shown in the first column of Table IV. These themes were identified based on consultation with researchers who work on tobacco policy at UKY. Before we get into details of various themes, we briefly describe different elements of Table IV. The ‘total’ column in the table indicates the total number of tweets in the dataset for a specific theme and the ‘by props’ column indicates the number of those tweets arising from proponents identified through our methods in Section IV. The count in the parentheses includes those thematic tweets originally tweeted by the proponents but were later retweeted by other tweeters. This is because retweets are counted as separate tweets. So, although these should be counted under the other tweeter group, since the original tweeters are proponents, we can attribute these retweets to the original proponent tweeters instead of the users who retweeted it. As we can see, although proponents form a very small percentage (third column of Table III) of the full set of tweeters, they generate a significant proportion of tweets for many of these popular e-cig themes. To compare the tweeting behavior of proponents compared with others, we use the measure (# tweets by proponents)/(# total tweets) proportion of proponents rate ratio = , (# tweets by others)/(# total tweets) proportion of others where all counts are for tweets belonging to a particular theme in the context of Table IV and proportions of proponents/others are over the full dataset (from the third column of Table III). In fact, this formula offers a different but more intuitive way to obtain the same result for all tweets as discussed in the previous paragraph where we compared average tweeting rates of proponents and others. One of the distinctive features of e-cigs compared with traditional cigarettes is the multitude of flavors available ranging from fruits to desserts. We went through websites of three popular e-cig company websites (Blu, Njoy, and VaporFi as identified in [33]) and curated a set of 22 popular flavors including menthol, strawberry, blueberry, cola, cherry, and mint in the order of their frequency. We simply searched for these flavor names in our e-cig tweets to obtain our counts shown in Table IV. As we notice from the rate ratio, proponents are 15 times more likely to tweet about e-cig flavors based on the 2013 sample and are 20 times more likely to do that based on recent data. Furthermore, there is a three fold increase in rate ratio (based on the older sample) if we also include others’ flavor tweets, which are retweets of proponent tweets. FDA might regulate the use of flavors as it had done for regular cigarettes and at least as of now it appears that proponents are heavily tweeting the flavor rich aspect of e-cigs. A caveat here is that some of the tweets mentioning e-cigs might also be discussing consuming the various berries as fruits instead of using flavored e-cig. However, our manual examination of several hundred flavor containing e-cig tweets revealed that it is very rare and only happens with the chocolate flavor given users often mentioned drinking hot chocolate while vaping. Given this disambiguation issue, the chocolate flavor was ignored in our analysis. Before we move ahead, we note that our thematic tweet identification was based on filtering with Python regular expressions that model lexical constraints. For the sake of clarity, we present more intuitive lexical expressions in the rest of this section. Furthermore, our focus was on precision of identifying tweets belonging to a particular theme and hence lexical expressions seemed more appropriate given the tweets are already related to e-cigs. Another popular theme in e-cig discussions is perceived harm reduction compared with traditional cigarettes. Although there might be merit in claims that e-cigs are not as harmful, publicity of this nature might point youth and other nonsmokers to another gateway to form nicotine addiction. Hence this may not be an appropriate way to promote e-cigs in general. We identified e-cig tweets on this topic by searching with the expressions: harm reduction, reduced harm, less harmful, safe[r] than, safe[r] alternative, and healthy/healthier alternative. Note that these searches are only limited to our tweets that are already filtered using e-cig related keywords as discussed in Section III. From Table IV, we see that in the recent data sample, proponents are 60 times more likely to discuss this aspect compared with other users. The ratio increased by 60/13 = 4.6 times from late 2013 to early 2015 for this theme. A major hindrance to using traditional cigarettes is the smoke they generate and the smoking bans in place due to that reason. Second hand smoke related consequences might also discourage smokers to reduce their consumption especially in public places and in the presence of family members. Hence ecig proponents are more likely to highlight the smoke free aspect of vaping. We applied further filtering on our e-cig tweets using the following expressions: 1. smoke[- ]free, 2. smoke[]less, and 3. tokens ‘tobacco’ or ‘smoking’ and the word ‘alternative’ in a tweet. We used optional hyphen or white space for the first two expressions and the third expression requires the occurrence of either tobacco or smoking in the tweet along with word alternative. From the third row of Table IV we can see that the rate ratio has increased three times from 2013 to 2015. It is also extremely high, at 42, as of early 2015. Our final theme is smoking cessation with the aid of e-cigs. Given evidence is still being gathered and clinical studies are being conducted to actually test these claims, it may not be appropriate to publicize e-cigs as means to quit smoking. To estimate the popularity of this theme, we filter our e-cig tweets based on the following tweet text constraints • • • • • stop/stopped/stopping smoking smoking cessation give up, giving up, or given up smoking/tobacco quit/quitting and tobacco/smoking in the tweet kick/kicked/kicking his/her/their/my/your smoking/tobacco For this theme, from the last row of Table IV, there is a staggering six fold (the highest among four themes) increase in rate ratio compared with the older dataset. Including retweets of tweets by proponents, they are 75 times more likely to discuss smoking cessation compared with regular tweeters. However, we also notice that the absolute volume of cessation related tweets has decreased given the newer dataset has 125,000 more tweets than the 2013 sample. Looking at the retweet included counts, we also notice that users are not retweeting such tweets as much in the recent dataset as researchers and federal agencies have also started awareness campaigns. Although we have been careful in the paper to convey that the thematic tweets simply match a few specific lexical patterns, we would be remiss if we did not also discuss an important shortcoming of this approach. In our filtering we did not account for negated mentions or more generally speaking the polarity of statements that discuss a given theme. For example, our dataset has tweets matching our cessation patterns that mention a research study where e-cigs were not shown to aid in cessation. Similarly tweets that say e-cigs are not less harmful are also included in the harm reduction theme. However, in this work we essentially identified tweets that discuss a theme but not necessarily their polarity, which is a crucial next step in our efforts on this topic. This does not necessarily take away from our analysis because proponents rarely discuss negative aspects of e-cigs especially with regards to harm reduction and smoking cessation. However, it would be interesting to understand the polarity of tweets by the others group along these two themes. VI. C ONCLUDING R EMARKS E-cigs are a popular emerging tobacco product currently not regulated by the FDA. As such, their sales and marketing are not subject to the stricter rules typically applied to regular cigarettes although individual states have recently enacted laws to regulate them to some extent. In this paper, to aid automated surveillance of e-cigs on social media, we conducted what we believe is the first study to automatically identify e-cig proponents on Twitter. Using a hand-labeled dataset, we built a classification model with features based on tweeter bio, recent tweet text, and user name. Our model achieves a precision of 97% with recall of 86% and can be used to classify new unseen profiles. We applied our model to two different datasets with complementary characteristics collected in late 2013 and March 2015. Our experiments showed that e-cig proponents on Twitter constitute a very small percentage of the tweeters who write about e-cigs. However, they tweet more often (two to five times) compared with other users and are tens of times more likely than others to highlight favorable, but not often scientifically corroborated, aspects of e-cig use. Based on this feasibility study we believe automated surveillance of e-cigs on Twitter is an important research direction that has tremendous application potential especially in the immediate future in the context of impending FDA initiated regulations. We identify several new research directions that can advance automated surveillance of e-cigs. Most of these tasks involve human annotation of user profiles and tweets to generate training data. 1) In this effort we focused on identifying proponents using a broad definition. However, an important future research direction is to identify fine grained classes, such as sales/marketing profiles, individual e-cig advocates who are not affiliated with any companies, regular e-cig users (even if they don’t explicitly advocate e-cigs), and pro-regulation representatives. 2) Given gender, age group, race and ethnicity can be predicted with reasonable accuracy [21]–[23], an important immediate future research direction is to use these methods to classify e-cig tweeters into these demographic categories and study e-cig themes in tweets by specific subpopulations. For example, given teenagers, and especially african american teens, are an active group on Twitter [17], studying this specific subpopulation with regards to popular e-cig topics may yield crucial insights into their usage patterns and perceptions. 3) Polytobacco is the practice of simultaneously using multiple forms of tobacco including regular cigarettes, e-cigs, hookah, and snus, which can lead to dangerous nicotine dependence. Another important question is to understand prevalence of polytobacco by spotting tweets that discuss such usage and identifying other forms being used along with e-cigs. Additionally, usage of addiction forming substances such as alcohol, illicit drugs, and prescription drugs along with e-cigs can also be studied by a more refined analysis of tweet content. 4) Another important direction is to identify “popular” tweets and factors contributing to the popularity of different types of e-cig related tweets where popularity is assessed in terms of retweets, replies, and favorites. For example, what tweet characteristics (such as presence of images, URLs, hashtags, numbers of followers) drive the popularity of e-cig sales/marketing tweets vs pro-regulation tweets. For tweets that gathered significant retweet/favorite/reply activity from teenagers, identify factors for such popularity including the proportion of their friends who contributed to such activity before them, their gender/and race. We believe this will not only aid in surveillance, but also in developing strategies to maximize the diffusion of results of scientific research and recommendations from FDA to a broader audience on Twitter, which will be critical to raise awareness. ACKNOWLEDGEMENTS Many thanks to Ellen Hahn of the College of Nursing at UKY for general discussions on e-cig themes. This research was supported by the National Center for Research Resources and the National Center for Advancing Translational Sciences, US National Institutes of Health (NIH), through Grant UL1TR000117 and the Kentucky Lung Cancer Research Program through Grant PO2-415-1400004000-1. The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. R EFERENCES [1] A. K. Regan, G. Promoff, S. R. Dube, and R. Arrazola, “Electronic nicotine delivery systems: adult use and awareness of the e-cigarette in the USA,” Tobacco Control, vol. 22, no. 1, pp. 19–23, 2013. [2] J.-F. Etter, C. Bullen, A. D. Flouris, M. Laugesen, and T. Eissenberg, “Electronic nicotine delivery systems: a research agenda,” Tobacco Control, vol. 20, no. 3, pp. 243–248, 2011. [3] C. Bullen, H. McRobbie, S. Thornley, M. Glover, R. Lin, and M. Laugesen, “Effect of an electronic nicotine delivery device (e cigarette) on desire to smoke and withdrawal, user preferences and nicotine delivery: randomised cross-over trial,” Tobacco Control, vol. 19, no. 2, pp. 98– 103, 2010. [4] R. A. Grana and P. M. Ling, “Smoking revolution: A content analysis of electronic cigarette retail websites,” American journal of preventive medicine, vol. 46, no. 4, pp. 395–403, 2014. [5] J. Brown, E. Beard, D. Kotz, S. Michie, and R. West, “Real-world effectiveness of e-cigarettes when used to aid smoking cessation: a crosssectional population study,” Addiction, vol. 109, no. 9, pp. 1531–1540, 2014. [6] C. Bullen, C. Howe, M. Laugesen, H. McRobbie, V. Parag, J. Williman, and N. Walker, “Electronic cigarettes for smoking cessation: a randomised controlled trial,” The Lancet, vol. 382, no. 9905, pp. 1629– 1637, 2013. [7] R. Grana, L. Popova, and P. Ling, “A longitudinal analysis of electronic cigarette use and smoking cessation,” JAMA Internal Medicine, vol. 174, no. 5, pp. 812–813, 2014. [8] K. A. Vickerman, K. M. Carpenter, T. Altman, C. M. Nash, and S. M. Zbikowski, “Use of electronic cigarettes among state tobacco cessation quitline callers,” Nicotine and Tobacco Research, vol. 15, no. 10, pp. 1787–1791, 2013. [9] A. C. King, L. J. Smith, P. J. McNamara, A. K. Matthews, and D. J. Fridberg, “Passive exposure to electronic cigarette (e-cigarette) use increases desire for combustible and e-cigarettes in young adult smokers,” Tobacco control, Online first. [10] K. E. Farsalinos and R. Polosa, “Safety evaluation and risk assessment of electronic cigarettes as tobacco cigarette substitutes: a systematic review,” Therapeutic advances in drug safety, vol. 5, no. 2, pp. 67–86, 2014. [11] A. Slomski, “Report shows e-cigarette marketing aimed at youth,” JAMA, vol. 311, no. 22, p. 2264, 2014. [12] M. McCarthy, “Youth exposure to e-cigarette advertising on US television soars,” BMJ: British Medical Journal, vol. 348, 2014. [13] M.-C. Tremblay, P. Pluye, G. Gore, V. Granikov, K. B. Filion, and M. J. Eisenberg, “Regulation profiles of e-cigarettes in the united states: a critical review with qualitative synthesis,” BMC medicine, vol. 13, no. 1, p. 130, 2015. [14] Centers for Disease Control and Prevention, “Notes from the field: Electronic cigarette use among middle and high school students – united states, 2011-2012,” Morbidity and Mortality Weekly Report, vol. Sept, 2013. [15] Centers for Disease Control. E-cigarette use triples among middle and high school students in just one year. http://www.cdc.gov/media/releases/ 2015/p0416-e-cigarette-use.html. [16] A. S. Tan and C. A. Bigman, “E-cigarette awareness and perceived harmfulness: Prevalence and associations with smoking-cessation outcomes,” American journal of preventive medicine, Online first. [17] Pew Research Internet Project. Part 1: Teens and social media use. http: //www.pewinternet.org/2013/05/21/part-1-teens-and-social-media-use/. [18] Alexa, Inc. (2014) Alexa top 500 global sites. http://www.alexa.com/ topsites. [19] Twitter, Inc. (2013) Registration with United States securities and exchanges commission. http://www.sec.gov/Archives/edgar/data/1418091/ 000119312513390321/d564001ds1.htm. [20] Y. Liu, C. Kliman-Silver, and A. Mislove, “The tweets they are achangin’: Evolution of twitter users and behavior,” in Proceedings of the Eighth AAAI International Conference on Weblogs and Social Media (ICWSM), 2014. [21] D. Nguyen, R. Gravel, D. Trieschnigg, and T. Meder, ““how old do you think i am?” a study of language and age in twitter.” in Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (ICWSM), 2013, pp. 439–448. [22] W. Liu and D. Ruths, “What’s in a name? using first names as features for gender inference in twitter.” in Proceedings of the AAAI Spring Symposium: Analyzing Microtext, 2013, pp. 10–16. [23] A. Culotta, N. R. Kumar, and J. Cutler, “Predicting the demographics of twitter users from website traffic data,” in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015, pp. 72–78. [24] P. Velardi, G. Stilo, A. E. Tozzi, and F. Gesualdo, “Twitter mining for [25] [26] [27] [28] [29] [30] [31] [32] fine-grained syndromic surveillance,” Artificial Intelligence in Medicine, vol. 61, no. 3, pp. 153–163, 2014. M. J. Paul and M. Dredze, “You are what you tweet: Analyzing twitter for public health.” in Proceedings of the Fifth AAAI International Conference on Weblogs and Social Media (ICWSM), 2011, pp. 265– 272. M. Dredze, “How social media will change public health,” Intelligent Systems, IEEE, vol. 27, no. 4, pp. 81–84, 2012. E. Aramaki, S. Maskawa, and M. Morita, “Twitter catches the flu: Detecting influenza epidemics using twitter,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, ser. EMNLP ’11. Association for Computational Linguistics, 2011, pp. 1568–1576. H. Park, S. Rodgers, and J. Stemmle, “Analyzing health organizations’ use of twitter for promoting health literacy,” Journal of health communication, vol. 18, no. 4, pp. 410–425, 2013. R. Teodoro and M. Naaman, “Fitter with twitter: Understanding personal health and fitness activity in social media,” in Proceedings of the Seventh AAAI International Conference on Weblogs and Social Media (ICWSM), 2013, pp. 611–620. C. C. Freifeld, J. S. Brownstein, C. M. Menone, W. Bao, R. Filice, T. Kass-Hout, and N. Dasgupta, “Digital drug safety surveillance: Monitoring pharmaceutical products in twitter,” Drug Safety, vol. 37, no. 5, pp. 343–350, 2014. M. Hefler, B. Freeman, and S. Chapman, “Tobacco control advocacy in the age of social media: using facebook, twitter and change,” Tobacco control, vol. 22, no. 3, pp. 210–214, 2013. M. Myslı́n, S.-H. Zhu, W. Chapman, and M. Conway, “Using twitter to examine smoking behavior and perceptions of emerging tobacco products,” Journal of medical Internet research, vol. 15, no. 8, 2013. [33] J. Huang, R. Kornfield, G. Szczypka, and S. L. Emery, “A cross-sectional examination of marketing of electronic cigarettes on twitter,” Tobacco control, vol. 23, no. suppl 3, pp. iii26–iii30, 2014. [34] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [35] S. Baccianella, A. Esuli, and F. Sebastiani, “SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining.” in LREC, vol. 10, 2010, pp. 2200–2204. [36] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” the Journal of machine Learning research, vol. 3, pp. 993–1022, 2003. [37] A. K. McCallum, “MALLET: A machine learning for language toolkit,” 2002, http://mallet.cs.umass.edu.