Toward Automated E-cigarette Surveillance:
Spotting E-cigarette Proponents on Twitter
∗ Division
Ramakanth Kavuluru∗† and AKM Sabbir†
of Biomedical Informatics, Dept. of Biostatistics, University of Kentucky, Lexington, KY
† Department
of Computer Science, University of Kentucky, Lexington, KY
{ramakanth.kavuluru, akm.sabbir}@uky.edu
Abstract—Electronic cigarettes (e-cigarettes or e-cigs) are a
popular emerging tobacco product. Because e-cigs do not generate toxic tobacco combustion products that result from smoking
regular cigarettes, they are sometimes perceived and promoted as
a less harmful alternative to smoking and also as means to quit
smoking. However, the safety of e-cigs and their efficacy in supporting smoking cessation is yet to be determined. Importantly,
the federal drug administration (FDA) currently does not regulate
e-cigs and as such their manufacturing, marketing, and sale is not
subject to the rules that apply to traditional cigarettes. A number
of manufacturers, advocates, and e-cig users are promoting e-cigs
on Twitter and according to a recent estimate obtained through
Twitter Inc., there are about 31,000 e-cig related tweets per day.
In this paper we develop a high accuracy supervised predictive
model (precision 97%, recall 86%) to automatically identify e-cig
“proponents” on Twitter. Analyzing their corresponding tweets
from a large corpus, we find that as opposed to regular tweeters
that form over 90% of the dataset, e-cig proponents are a
smaller subset but tweet two to five times more than regular
tweeters and also disproportionately highlight e-cig flavors, their
smoke-free and potential harm reduction aspects, and their
claimed use in smoking cessation. Given FDA is currently in
the process of proposing meaningful regulation, we believe our
work demonstrates the strong potential of machine learning
approaches for automated e-cig surveillance on Twitter.
I. I NTRODUCTION
Electronic cigarettes (e-cigarettes or simply e-cigs) were
introduced in the United States (US) in 2007 [1] and are
currently a popular emerging tobacco product across the
world. An e-cig essentially consists of a battery that heats
up liquid nicotine available in a cartridge into a vapor that is
inhaled by the user [2]. E-cig users are termed vapers and the
process of using an e-cig is called vaping. E-cigs are similar
to conventional tobacco cigarettes with regards to visual,
sensory, and behavioral aspects and hence were observed to
reduce craving [3]. Owing to their recent introduction, there
are very few studies on e-cig safety, risk of abuse, and their
efficacy as a smoking cessation aid especially about long term
use effects. In fact, currently the search phrase electronic
nicotine delivery systems OR e-cigarette
OR electronic cigarette yields 553 articles in the
PubMed search system out of which 475 (or 86%) had
dates of publication in 2014 or 2015. Because e-cigs do not
generate toxic combustion products that are produced with
tobacco cigarettes, they are perceived and also sometimes
marketed as suitable alternatives for smoking cessation [4].
However, scientific research to verify these claims is limited
and is often inconclusive. On one hand there are studies that
indicate comparable or superior effectiveness of e-cigs in
smoking cessation [5], [6]. However, there are also results [7],
[8] that show no such associations exist between e-cig use
and quitting or reduced conventional cigarette consumption.
Another recent effort [9] also indicates that passive exposure
to e-cigs increases the desire to smoke both regular cigarettes
and e-cigs. Nevertheless, current research seems to indicate
that they are less harmful than traditional cigarettes [10].
The ongoing healthy scientific debate around e-cigs is
welcomed by the society, especially by regular smokers who
are interested in quitting or adopting less harmful alternatives.
However, lack of FDA regulation (except for therapeutic use)
has heavily increased marketing of e-cigs on the Web [11]
and through television ads [12] even if individual states have
recently started to enact their own regulations to limit sales,
marketing, and use [13]. According to a 2013 Centers for
Disease Control and Prevention (CDC) report [14], e-cig
consumption doubled in middle and high school students from
2011 to 2012. Furthermore, 9.3% of middle and high school
ever e-cig users in 2012 have never smoked conventional
cigarettes. Alarmingly, this percentage goes up to 20.3% when
considering only middle school students. A more recent CDC
report [15] shows that e-cig use tripled from 2013 to 2014
among middle and high school students. Since long term
safety of e-cigs has not been thoroughly studied yet, the
prospect of adolescents developing nicotine dependence could
be detrimental to public health in future generations. When
considering adult smokers, however, the significant increase in
e-cig awareness has reduced their perception of e-cigs as being
less harmful compared with regular cigarettes [16]. Since Web
based advertising and discussion still plays a major role in
e-cig marketing and use and given one in four online US
teenagers uses Twitter [17], we believe it is critical to study
the landscape of e-cig messages and their authors on Twitter.
Although e-cig message themes and author classification
might be highly granular, in this pilot project we take a
simpler approach to tweet author classification – each tweeter
is either a “proponent” or not for our purposes. Proponents
are tweeters who represent e-cig sales or marketing agencies,
individuals who advocate e-cigs, or tweeters who specifically
identify themselves as vapers in their profile bio. Essentially
these tweeters are generally more inclined to support e-cigs
regardless of their specific motivation (e.g., business, lobbying,
smoking cessation). In this paper, based on a hand-labeled
dataset of 1000 tweeter profiles, we build machine learned
models to automatically identify proponents. We subsequently
use this model to analyze the content of tweets generated
by proponents in comparison with other tweeters along several well known e-cig themes (e-cig flavors, harm reduction,
smoke-fee aspect, and smoking cessation) using straightforward text processing. We demonstrate that proponents are
many times as likely to highlight the attractive (and sometimes
scientifically not yet verified) aspects of e-cigs compared with
regular tweeters. To our knowledge this is the first attempt in
identifying proponents and a first step in building a framework
for automatic surveillance of e-cig related chatter.
of such tweets mention smoking cessation. In our effort,
instead of identifying commercial tweets, we identify e-cig
proponent Twitter profiles using supervised machine learning.
We believe this is a more direct approach that aids in electronic
surveillance efforts needed in the immediate future to monitor
e-cig marketing/publicity practices and as such complements
Huang et al.’s effort. Furthermore, compared with 2012, there
is an order of magnitude increase in the number of e-cig
tweets (based on an official quote from Twitter Inc.) and our
experiments are conducted on a corpus of over one million
e-cig related tweets. After identifying proponent profiles, we
conducted analyses along well known e-cig themes to see
differences between tweeting behaviors of proponents and
regular tweeters.
II. BACKGROUND AND R ELATED W ORK
We used two different datasets of e-cig related tweets:
the first set of 224,000 tweets was obtained using rate
limited Twitter streaming API during the months of
September to December 2013 and a second dataset of
nearly one million tweets (purchased from the exhaustive
Twitter firehose) for the month of March 2015 that
match the query terms: electronic-cigarette,
e-cig, e-cigarette, e-juice, e-liquid,
vape-juice, and vape-liquid. Variants of these terms
with spaces instead of hyphens or just without the hyphens
(for matching hashtags) were also used in the query. These
terms were chosen in consultation with faculty members in
the College of Nursing at the University of Kentucky (UKY)
who work on e-cig policy research. They are specific enough
and empirically shown to result in a 99% match to actual
e-cig related tweets [33]. The juice/liquid terms represent
the liquid nicotine cartridges that need to be refilled for the
vaping devices. The smaller dataset represents a free sample
curated during a four month period and the second larger
dataset constitutes a full dataset from one month. Thus, we
get a recent estimate of around 31,000 e-cig tweets per day
(as of March 2015) compared with about 1200 such tweets
per day in mid 2012 [33], indicating a 25 fold increase.
Our supervised prediction of proponents on Twitter relies
on each tweeter’s Twitter username, profile bio/description,
and most recent tweets. The username or handle is unique
to each tweeter and can be upto 15 characters in length.
Optionally, each tweeter can also choose to write a profile
bio of at most 160 characters to characterize his/her persona
on Twitter. If a tweeter never tweets about e-cigs (matches
none of our query terms) during a period of surveillance, we
automatically assume that they are not a proponent. Thus only
those tweeters who have authored at least one e-cig tweet
are considered as candidates for classification. Due to reasons
that we elaborate later, we classify profiles with empty bios
differently compared with those that have non-empty profile
descriptions. Based on a sample of 34,000 unique users with at
least one e-cig tweet during our four month sample collection,
we determine that approximately 20% of e-cig tweeters have
empty profile descriptions where the tweeters choose not to
Since its introduction in 2006, Twitter has grown into one
of the top 15 visited websites [18] in the world with 100
million daily active users who generate over 500 million tweets
per day [19]. The asymmetric network structure of Twitter
inherently supports information diffusion and given that a
recent study [20] reveals that over 95% of Twitter profiles
are public, mining tweets is a practical tool to measure user
engagement with various events and products. Since users are
not required to publicly declare personal information, several
recent studies have focused on identifying user demographic
attributes such as age groups and life stages [21], gender [22],
and race and ethnicity [23].
In the context of public health, Twitter based automatic syndromic surveillance has been shown to have high correlation
with traditional surveillance methods [24], [25] with the added
advantage of near real time access to trends, especially in
the early epidemic stages [26], [27]. Recent efforts also noted
Twitter’s suitability for promoting health literacy [28], encouraging fitness activity [29], and monitoring drug safety [30].
In the context of tobacco control advocacy, researchers found
significant reach through Twitter in obtaining signatures for an
online petition to drop tobacco sponsorship for an international
music concert in Indonesia [31]. Another recent Twitter based
study focused on emerging tobacco products [32] found high
prevalence of positive sentiment for hookah and e-cig. It also
successfully demonstrated the application of machine learning
methods in automatically identifying tobacco related tweets,
constituent themes, and sentiments.
The most relevant effort in the context of our paper is from
Huang et al. [33] who automatically identify “commercial”
tweets from a corpus of nearly 73,000 tweets collected in
the months of May and June in 2012. For their purposes
tweets that contain links to sales websites and promotional
messages are all commercial regardless of who posts them
(regular tweeters vs e-cig marketers/advocates). They use DiscoverText, a cloud based commercial text analytics software
program, to semi-automatically (Naive Bayes with additional
heuristics) classify tweets as commercial or not. They report
that 90% of their tweets are commercial and nearly 10%
III. DATASETS AND A NNOTATION
Feature description
Presence in best combination
unigrams and bigrams from bio and recent tweets
✓
Part of speech tags of bio and recent tweets
✓
Average positive and negative polarity scores of the bio and recent tweets
✓
Topic distribution of bio: Based on 10 topics for bios and 20 topics for recent tweets
generated using LDA based on training dataset profiles
✓
Presence of user mentions, URLs, and punctuation marks in the bio
✗
Length of the bio and average length of recent tweets; also binarized versions of these
features: whether the lengths are above the averages determined on training data
✗
Presence of the following terms as substrings in user name: vape, vapor, vapour, vaping,
ecig, eliquid, ejuice (e.g., @ecighunter, @askavaper, @vapeclub)
✓
TABLE I: Feature groups explored for predicting e-cig proponents
provide a bio. Our supervised approach is designed for the
80% of e-cig tweeters with non-empty bios. We use a simple
unsupervised approach for tweeters with empty bios. Next, we
outline the training dataset creation for classifying tweeters
with non-empty bios.
We randomly chose 1000 tweeter profiles with non-empty
bios from our corpus of e-cig tweets. Two annotators independently annotated each of those profiles with a positive
(proponent) label if the bio indicates that they are in e-cig
sales/marketing or if they are advocates or vapers. This seemed
reasonable given our manual overview of the dataset indicated
that bio text is often a strong indicator of tweeter perception
of e-cigs. For example, the following are a few real bios from
our dataset.
•
•
•
•
•
•
“a vaper trying to help other find that sweet state of vape.”
“I love e-cigs. If anyone wants to buy, I can hook you up.”
“a Finnish vaping advocate who semi occasionally enjoys
rambling on youtube”
“Proud vaper and constitutionalist. I support vaping and the
constitution as is was written.”
“Dedicated to bringing the best deals on vaping needs.”
“Manufacturer and distributor of American made premium
vape Juice. We also sell e-cigarettes and supplies from
around the world.”
If the bio does not give enough evidence to make a conclusive
decision, the annotators looked at the recent 200 tweets
generated from that profile to see if they are predominantly
favoring e-cigs. We note that tweeters who retweet e-cig
favoring tweets are also considered proponents even if they
are not original authors of those tweets. We reiterate that
for our purposes proponents are tweeters who are generally
more inclined to support e-cigs regardless of their specific
motivation (e.g., business, lobbying, smoking cessation). We
obtained substantial inter annotator agreement (κ = 0.88) and
after ignoring the 43 profiles where the annotators disagreed,
we have 957 profiles in the final dataset with 216 proponents
and 741 in the negative class.
IV. P ROPONENT C LASSIFICATION M ODEL
Given the short size of the profile bio we just used the logistic regression (LR) classifier in the Python Scikit-Learn [34]
machine learning library. As indicated in Section III, we use
features extracted from the username, bio, and recent tweets
to build the classification model. We split our dataset into
70% training and 30% test splits with stratified sampling of
both classes. Using average four fold cross-validation F-scores
computed over 200 distinct shuffles on the training dataset,
we identified the best feature combination from all the feature
types we experimented with as shown in Table I.
Unigrams and bigrams are traditional n-gram features typically used in text classification. Besides this, we also use
parts of speech of different tokens in both bio and recent
tweets’ text. In addition to this we also used sentiment score
features from bio and recent tweets. For a given bio (or a set
of recent tweets taken as a single text blob), our features are
the average positive and negative scores of the text computed
over all sentiment words in it using the scores available in
SentiWordNet 3.0 [35]. We also included topic modeling based
features using bios and recent tweets. The central idea is
to apply latent Dirichlet allocation [36] modeling using the
MALLET [37] toolkit to the bio text from all profiles in
the training dataset and at the test time “fold in” a new bio
into the model to infer probabilities P (ti |b) of the topic ti
given the new bio b. Based on experiments we determined
ten is the ideal number of topics for the bios and twenty
topics for modeling the recent tweets. This means that for
each bio, we will have as features P (t1 |b), . . . , P (t10 |b) for
the new bio b. Similarly, for each set of recent tweets we have
twenty features based on topic modeling. We also used binary
features to incorporate presence of user mentions, URLs, and
certain punctuation marks (e.g., !, ?) in the bio. However, these
features did not end up in best feature combination. Similarly
length of the bio and average length of the recent tweets also
did not make it into the final model based on cross validation
experiments. Another important feature is the presence of e-
cig related keywords appearing as substrings of the tweeter
user name as shown in the final row of Table I.
A key finding in our initial experiments was that simply using recent tweets based and profile bio based features together
in the same model is not very helpful since crucial predictive
signal in the bio text (being short, about 160 characters) was
getting drowned out by features from the tweeter’s recent
tweets even when we considered only 10 or 20 recent tweets
in our experiments. To counter this we trained two separate
LR models: one based on just the bio and username based
features (say, M b ) and the second one based only on recent
tweets’ features (say, M r ). The final prediction model M is
the weighted average of positive class probability estimates
output by both models. That is
PM = αPM b + (1 − α)PM r ,
where α ∈ [0, 1]. We determined that α = 0.85 and the
recent 20 tweets to be the best configuration for this weighted
averaging approach by maximizing cross validation F-scores
on the training data. Specifically, to find optimum α we did a
simple grid search with 0.1 increments starting with α = 0.1
and noticed a maximum F-score at a value of 0.8 for α with
dips on either side at 0.7 and 0.9. Then we conducted another
grid search experiment with 0.01 increments in the range
0.7 < α < 0.9 and found a new maximum F-score value
at α = 0.85. Thus we use α = 0.85 in all our experiments in
the rest of the paper. Intuitively, this means that profile bio is
nearly six times more important than recent tweets in spotting
proponents, which is not surprising given our single model
with both feature types resulted in very poor performance.
Using the best feature combination as represented in Table I
and the weighted model averaging approach with α = 0.85,
we trained a model on the training data and tested on 30% test
set to obtain a F-score of 0.896 (with precision 0.96 and recall
0.84). Since this is the score on a single run, we considered
500 distinct shuffles of our full dataset. For each shuffle, we
used stratified sampling (maintaining class proportions) to split
it into 80%-20% train-test sets and using the best combination,
trained on the 80% set and tested on the 20% set. Using this
approach and descriptive statistics, we obtain a mean F-score
0.9152 (0.9147 – 0.9157), mean precision 0.9721 (0.9716 –
0.9726), and mean recall 0.8663 (0.8656 – 0.8671) with 95%
confidence intervals shown in parentheses.
Using 500 distinct shuffles of the dataset, we ran feature
ablation experiments where we removed one feature at a time
to measure the performance drop incurred, which indicates
the contribution of that feature to the overall model. From
ablation results shown in Table II we see that dropping recent
tweet features causes biggest drop in recall and dropping the
username substring feature (last row of Table I) causes the
biggest drop in precision and F-score. POS tag ablation causes
a small drop in performance compared with polarity score
and topic distribution score removal. This is not surprising
because the tweets and bios of proponents seemed to largely
revolve around e-cig themes while regular tweeters discuss
varied topics and are not focused on e-cigs. For all features,
the drop in recall is always larger than the drop in precision.
Precision
Recall
F-score
Full model
0.9721
0.8663
0.9152
– POS tags
0.9596
0.8458
0.8979
– Polarity scores
0.9361
0.7838
0.8514
– Topic scores
0.9342
0.7869
0.8524
– Username
0.8900
0.7802
0.8295
– Recent tweets
0.9229
0.7733
0.8394
TABLE II: Feature ablation results with averages computed
over 500 distinct shuffles of the dataset
Overall our efforts have resulted in a very high precision
model with reasonable recall for profiles with non-empty bios.
Specifically, the feature that incorporates e-cig related terms
as substrings of tweeter username turns out to be a powerful
predictor for spotting proponents. We use this specific feature
in an unsupervised fashion to identify proponents for tweeters
with empty bios (about a fifth of all e-cig tweeters). That is, for
profiles with empty bios we simply see if certain e-cig related
terms (last row of Table I) are in the username and if they
are present, we classify corresponding tweeters as proponents.
A manual examination of 500 profiles with user names that
have these terms as substrings reveals a 99% precision of this
approach in identifying proponents for tweeters with empty
bios. However, at this point, we don’t have a methodical way
to do a supervised approach for identifying proponents with
empty bios that do not fit this simple criterion. Still, most users
whose bios are empty and do not match this user name based
rule appear to be regular non-proponent tweeters based on a
manual examination of a sample of such profiles.
V. A NALYSIS OF P ROPONENT T WEETS
In this section, we analyze the tweets generated by proponents and other tweeters along familiar e-cig themes. Before
we proceed, we introduce a slight modification to the way
we apply the model built in Section IV. We analyzed the
confusion matrix of the test set predictions of our model and
noticed that there were several false negatives which could
have been captured by applying the simple user name substring
match approach used for empty profiles (last row of Table I).
Given this particular substring based classification yields near
perfect precision (see Section IV), for the rest of the paper
we apply our model to non-empty profiles that do not match
the user name substring match. However, we note that this
user name based identification approach is not comprehensive
because proponents do not always use such user names. In
our experiments with two datasets in the rest of the section,
our full model yields 25-35% (depending on the dataset used:
sample vs exhaustive) more proponents compared with the
simpler user name based identification.
Tweeters
Dataset
Tweets
Total
Proponents
Total
Proponents
2013 Sep-Dec sample
34,000
2540 (7.5%)
224,000
32,682 (14.6%)
2015 March subset
100,000
4359 (4.3%)
349,401
72,384 (20.7%)
TABLE III: Proponent and corresponding tweet counts in both datasets
2013 Sep-Dec sample tweets
E-cig theme
2015 March subset tweets
Total
By props (+RTs)
Rate ratio (+RTs)
Total
By props (+RTs)
Rate ratio (+RTs)
4018
2258 (3175)
15 (46)
10,855
5207 (6400)
20 (31)
374
193 (212)
13 (16)
1527
1118 (1227)
60 (91)
Smoke-free aspect
1902
1033 (1350)
14 (30)
11,220
7380 (7657)
42 (47)
Smoking cessation
5228
1923 (2991)
7 (16)
5863
3820 (4532)
41 (75)
Flavors
Harm reduction
TABLE IV: Thematic distribution of e-cig tweets for proponents vs other tweeters
To analyze the tweets from proponents compared with the
rest of the users, we apply our model to the tweeters in
two different datasets introduced in Section III. The first
dataset is a free rate limited sample from a four month period
(September to December of 2013) and the second dataset is
an exhaustive dataset of nearly one million tweets for the
month of March 2015. This newer one month dataset has
nearly 360,000 tweeters. Given our model depends also on
recent tweets and Twitter imposes prohibitive rate limits for
collecting recent tweets in a timely fashion, we chose to look
at 100,000 randomly chosen tweeters and their tweets (nearly
350,000) in the new dataset. The old free API based dataset has
34,000 tweeters and 224,000 tweets in the sample generated
by them. Although we consider a subset of the tweeters in the
new dataset, we believe this captures a different perspective
given we are more likely to hit frequent tweeters with the rate
limited free sampling approach but are likely to incorporate
more infrequent tweeters with the selected subset from the
one month dataset. After applying our classification model to
these tweeters in both datasets, we obtain results as shown
in Table III. We notice that the percentage of proponents is
3% more in the older sample compared to the subset of the
exhaustive one month dataset. This is not surprising given the
subset of the exhaustive sample is more likely to capture tweets
from infrequent tweeters who are occasionally tweeting about
e-cigs.
The final two columns of Table III show the total number
of tweets and corresponding sizes of the subsets generated
by the proponents. From the first row, for the rate limited
sample from 2013, we notice that on average a proponent
generates twice as many tweets as other tweeters. Based on
the second row we notice that in the 2015 sample, on average
proponents tweet more than fives time as other tweeters. We
compute this by simply comparing the average number of
tweets by proponents (72384/4359 = 16.6) with those by
regular tweeters: (349401 − 72384)/(100000 − 4359) = 2.9.
We indicated in Section III that the number of tweets per day
on e-cigs has increased 25 times compared with the rate in
2012. Here we also observed that the tweets by proponents are
also increasing considerably compared with those by regular
tweeters.
Next we focus on tweet content analysis based on four
popular e-cig themes shown in the first column of Table IV.
These themes were identified based on consultation with
researchers who work on tobacco policy at UKY. Before we
get into details of various themes, we briefly describe different
elements of Table IV. The ‘total’ column in the table indicates
the total number of tweets in the dataset for a specific theme
and the ‘by props’ column indicates the number of those
tweets arising from proponents identified through our methods
in Section IV. The count in the parentheses includes those
thematic tweets originally tweeted by the proponents but were
later retweeted by other tweeters. This is because retweets
are counted as separate tweets. So, although these should
be counted under the other tweeter group, since the original
tweeters are proponents, we can attribute these retweets to the
original proponent tweeters instead of the users who retweeted
it. As we can see, although proponents form a very small
percentage (third column of Table III) of the full set of
tweeters, they generate a significant proportion of tweets for
many of these popular e-cig themes. To compare the tweeting
behavior of proponents compared with others, we use the
measure
(# tweets by proponents)/(# total tweets)
proportion of proponents
rate ratio =
,
(# tweets by others)/(# total tweets)
proportion of others
where all counts are for tweets belonging to a particular theme
in the context of Table IV and proportions of proponents/others
are over the full dataset (from the third column of Table III).
In fact, this formula offers a different but more intuitive way
to obtain the same result for all tweets as discussed in the
previous paragraph where we compared average tweeting rates
of proponents and others.
One of the distinctive features of e-cigs compared with
traditional cigarettes is the multitude of flavors available
ranging from fruits to desserts. We went through websites of
three popular e-cig company websites (Blu, Njoy, and VaporFi
as identified in [33]) and curated a set of 22 popular flavors including menthol, strawberry, blueberry, cola, cherry, and mint
in the order of their frequency. We simply searched for these
flavor names in our e-cig tweets to obtain our counts shown
in Table IV. As we notice from the rate ratio, proponents are
15 times more likely to tweet about e-cig flavors based on the
2013 sample and are 20 times more likely to do that based on
recent data. Furthermore, there is a three fold increase in rate
ratio (based on the older sample) if we also include others’
flavor tweets, which are retweets of proponent tweets. FDA
might regulate the use of flavors as it had done for regular
cigarettes and at least as of now it appears that proponents
are heavily tweeting the flavor rich aspect of e-cigs. A caveat
here is that some of the tweets mentioning e-cigs might also
be discussing consuming the various berries as fruits instead
of using flavored e-cig. However, our manual examination of
several hundred flavor containing e-cig tweets revealed that
it is very rare and only happens with the chocolate flavor
given users often mentioned drinking hot chocolate while
vaping. Given this disambiguation issue, the chocolate flavor
was ignored in our analysis.
Before we move ahead, we note that our thematic tweet
identification was based on filtering with Python regular
expressions that model lexical constraints. For the sake of
clarity, we present more intuitive lexical expressions in the
rest of this section. Furthermore, our focus was on precision
of identifying tweets belonging to a particular theme and hence
lexical expressions seemed more appropriate given the tweets
are already related to e-cigs.
Another popular theme in e-cig discussions is perceived
harm reduction compared with traditional cigarettes. Although
there might be merit in claims that e-cigs are not as harmful,
publicity of this nature might point youth and other nonsmokers to another gateway to form nicotine addiction. Hence
this may not be an appropriate way to promote e-cigs in
general. We identified e-cig tweets on this topic by searching
with the expressions: harm reduction, reduced harm, less
harmful, safe[r] than, safe[r] alternative, and healthy/healthier
alternative. Note that these searches are only limited to our
tweets that are already filtered using e-cig related keywords
as discussed in Section III. From Table IV, we see that in
the recent data sample, proponents are 60 times more likely
to discuss this aspect compared with other users. The ratio
increased by 60/13 = 4.6 times from late 2013 to early 2015
for this theme.
A major hindrance to using traditional cigarettes is the
smoke they generate and the smoking bans in place due to that
reason. Second hand smoke related consequences might also
discourage smokers to reduce their consumption especially in
public places and in the presence of family members. Hence ecig proponents are more likely to highlight the smoke free aspect of vaping. We applied further filtering on our e-cig tweets
using the following expressions: 1. smoke[- ]free, 2. smoke[]less, and 3. tokens ‘tobacco’ or ‘smoking’ and the word
‘alternative’ in a tweet. We used optional hyphen or white
space for the first two expressions and the third expression
requires the occurrence of either tobacco or smoking in the
tweet along with word alternative. From the third row of
Table IV we can see that the rate ratio has increased three
times from 2013 to 2015. It is also extremely high, at 42, as
of early 2015.
Our final theme is smoking cessation with the aid of e-cigs.
Given evidence is still being gathered and clinical studies are
being conducted to actually test these claims, it may not be
appropriate to publicize e-cigs as means to quit smoking. To
estimate the popularity of this theme, we filter our e-cig tweets
based on the following tweet text constraints
•
•
•
•
•
stop/stopped/stopping smoking
smoking cessation
give up, giving up, or given up smoking/tobacco
quit/quitting and tobacco/smoking in the tweet
kick/kicked/kicking his/her/their/my/your smoking/tobacco
For this theme, from the last row of Table IV, there is a
staggering six fold (the highest among four themes) increase in
rate ratio compared with the older dataset. Including retweets
of tweets by proponents, they are 75 times more likely to
discuss smoking cessation compared with regular tweeters.
However, we also notice that the absolute volume of cessation
related tweets has decreased given the newer dataset has
125,000 more tweets than the 2013 sample. Looking at the
retweet included counts, we also notice that users are not
retweeting such tweets as much in the recent dataset as
researchers and federal agencies have also started awareness
campaigns.
Although we have been careful in the paper to convey
that the thematic tweets simply match a few specific lexical
patterns, we would be remiss if we did not also discuss
an important shortcoming of this approach. In our filtering
we did not account for negated mentions or more generally
speaking the polarity of statements that discuss a given theme.
For example, our dataset has tweets matching our cessation
patterns that mention a research study where e-cigs were not
shown to aid in cessation. Similarly tweets that say e-cigs
are not less harmful are also included in the harm reduction
theme. However, in this work we essentially identified tweets
that discuss a theme but not necessarily their polarity, which
is a crucial next step in our efforts on this topic. This does not
necessarily take away from our analysis because proponents
rarely discuss negative aspects of e-cigs especially with regards
to harm reduction and smoking cessation. However, it would
be interesting to understand the polarity of tweets by the others
group along these two themes.
VI. C ONCLUDING R EMARKS
E-cigs are a popular emerging tobacco product currently
not regulated by the FDA. As such, their sales and marketing
are not subject to the stricter rules typically applied to regular
cigarettes although individual states have recently enacted laws
to regulate them to some extent. In this paper, to aid automated
surveillance of e-cigs on social media, we conducted what
we believe is the first study to automatically identify e-cig
proponents on Twitter. Using a hand-labeled dataset, we built a
classification model with features based on tweeter bio, recent
tweet text, and user name. Our model achieves a precision
of 97% with recall of 86% and can be used to classify new
unseen profiles. We applied our model to two different datasets
with complementary characteristics collected in late 2013 and
March 2015. Our experiments showed that e-cig proponents
on Twitter constitute a very small percentage of the tweeters
who write about e-cigs. However, they tweet more often (two
to five times) compared with other users and are tens of
times more likely than others to highlight favorable, but not
often scientifically corroborated, aspects of e-cig use. Based
on this feasibility study we believe automated surveillance of
e-cigs on Twitter is an important research direction that has
tremendous application potential especially in the immediate
future in the context of impending FDA initiated regulations.
We identify several new research directions that can advance
automated surveillance of e-cigs. Most of these tasks involve
human annotation of user profiles and tweets to generate
training data.
1) In this effort we focused on identifying proponents using a broad definition. However, an important future research direction is to identify fine grained classes, such as
sales/marketing profiles, individual e-cig advocates who are
not affiliated with any companies, regular e-cig users (even
if they don’t explicitly advocate e-cigs), and pro-regulation
representatives.
2) Given gender, age group, race and ethnicity can be predicted with reasonable accuracy [21]–[23], an important
immediate future research direction is to use these methods
to classify e-cig tweeters into these demographic categories
and study e-cig themes in tweets by specific subpopulations.
For example, given teenagers, and especially african american teens, are an active group on Twitter [17], studying
this specific subpopulation with regards to popular e-cig
topics may yield crucial insights into their usage patterns
and perceptions.
3) Polytobacco is the practice of simultaneously using multiple forms of tobacco including regular cigarettes, e-cigs,
hookah, and snus, which can lead to dangerous nicotine
dependence. Another important question is to understand
prevalence of polytobacco by spotting tweets that discuss
such usage and identifying other forms being used along
with e-cigs. Additionally, usage of addiction forming substances such as alcohol, illicit drugs, and prescription drugs
along with e-cigs can also be studied by a more refined
analysis of tweet content.
4) Another important direction is to identify “popular” tweets
and factors contributing to the popularity of different types
of e-cig related tweets where popularity is assessed in terms
of retweets, replies, and favorites. For example, what tweet
characteristics (such as presence of images, URLs, hashtags, numbers of followers) drive the popularity of e-cig
sales/marketing tweets vs pro-regulation tweets. For tweets
that gathered significant retweet/favorite/reply activity from
teenagers, identify factors for such popularity including the
proportion of their friends who contributed to such activity
before them, their gender/and race. We believe this will not
only aid in surveillance, but also in developing strategies
to maximize the diffusion of results of scientific research
and recommendations from FDA to a broader audience on
Twitter, which will be critical to raise awareness.
ACKNOWLEDGEMENTS
Many thanks to Ellen Hahn of the College of Nursing at
UKY for general discussions on e-cig themes. This research
was supported by the National Center for Research Resources and the National Center for Advancing Translational
Sciences, US National Institutes of Health (NIH), through
Grant UL1TR000117 and the Kentucky Lung Cancer Research
Program through Grant PO2-415-1400004000-1. The content
of this paper is solely the responsibility of the authors and
does not necessarily represent the official views of the NIH.
R EFERENCES
[1] A. K. Regan, G. Promoff, S. R. Dube, and R. Arrazola, “Electronic
nicotine delivery systems: adult use and awareness of the e-cigarette in
the USA,” Tobacco Control, vol. 22, no. 1, pp. 19–23, 2013.
[2] J.-F. Etter, C. Bullen, A. D. Flouris, M. Laugesen, and T. Eissenberg,
“Electronic nicotine delivery systems: a research agenda,” Tobacco
Control, vol. 20, no. 3, pp. 243–248, 2011.
[3] C. Bullen, H. McRobbie, S. Thornley, M. Glover, R. Lin, and M. Laugesen, “Effect of an electronic nicotine delivery device (e cigarette) on
desire to smoke and withdrawal, user preferences and nicotine delivery:
randomised cross-over trial,” Tobacco Control, vol. 19, no. 2, pp. 98–
103, 2010.
[4] R. A. Grana and P. M. Ling, “Smoking revolution: A content analysis
of electronic cigarette retail websites,” American journal of preventive
medicine, vol. 46, no. 4, pp. 395–403, 2014.
[5] J. Brown, E. Beard, D. Kotz, S. Michie, and R. West, “Real-world
effectiveness of e-cigarettes when used to aid smoking cessation: a crosssectional population study,” Addiction, vol. 109, no. 9, pp. 1531–1540,
2014.
[6] C. Bullen, C. Howe, M. Laugesen, H. McRobbie, V. Parag, J. Williman, and N. Walker, “Electronic cigarettes for smoking cessation: a
randomised controlled trial,” The Lancet, vol. 382, no. 9905, pp. 1629–
1637, 2013.
[7] R. Grana, L. Popova, and P. Ling, “A longitudinal analysis of electronic
cigarette use and smoking cessation,” JAMA Internal Medicine, vol. 174,
no. 5, pp. 812–813, 2014.
[8] K. A. Vickerman, K. M. Carpenter, T. Altman, C. M. Nash, and S. M.
Zbikowski, “Use of electronic cigarettes among state tobacco cessation
quitline callers,” Nicotine and Tobacco Research, vol. 15, no. 10, pp.
1787–1791, 2013.
[9] A. C. King, L. J. Smith, P. J. McNamara, A. K. Matthews, and
D. J. Fridberg, “Passive exposure to electronic cigarette (e-cigarette)
use increases desire for combustible and e-cigarettes in young adult
smokers,” Tobacco control, Online first.
[10] K. E. Farsalinos and R. Polosa, “Safety evaluation and risk assessment
of electronic cigarettes as tobacco cigarette substitutes: a systematic
review,” Therapeutic advances in drug safety, vol. 5, no. 2, pp. 67–86,
2014.
[11] A. Slomski, “Report shows e-cigarette marketing aimed at youth,”
JAMA, vol. 311, no. 22, p. 2264, 2014.
[12] M. McCarthy, “Youth exposure to e-cigarette advertising on US television soars,” BMJ: British Medical Journal, vol. 348, 2014.
[13] M.-C. Tremblay, P. Pluye, G. Gore, V. Granikov, K. B. Filion, and M. J.
Eisenberg, “Regulation profiles of e-cigarettes in the united states: a
critical review with qualitative synthesis,” BMC medicine, vol. 13, no. 1,
p. 130, 2015.
[14] Centers for Disease Control and Prevention, “Notes from the field:
Electronic cigarette use among middle and high school students – united
states, 2011-2012,” Morbidity and Mortality Weekly Report, vol. Sept,
2013.
[15] Centers for Disease Control. E-cigarette use triples among middle and
high school students in just one year. http://www.cdc.gov/media/releases/
2015/p0416-e-cigarette-use.html.
[16] A. S. Tan and C. A. Bigman, “E-cigarette awareness and perceived harmfulness: Prevalence and associations with smoking-cessation outcomes,”
American journal of preventive medicine, Online first.
[17] Pew Research Internet Project. Part 1: Teens and social media use. http:
//www.pewinternet.org/2013/05/21/part-1-teens-and-social-media-use/.
[18] Alexa, Inc. (2014) Alexa top 500 global sites. http://www.alexa.com/
topsites.
[19] Twitter, Inc. (2013) Registration with United States securities and exchanges commission. http://www.sec.gov/Archives/edgar/data/1418091/
000119312513390321/d564001ds1.htm.
[20] Y. Liu, C. Kliman-Silver, and A. Mislove, “The tweets they are achangin’: Evolution of twitter users and behavior,” in Proceedings of
the Eighth AAAI International Conference on Weblogs and Social Media
(ICWSM), 2014.
[21] D. Nguyen, R. Gravel, D. Trieschnigg, and T. Meder, ““how old do you
think i am?” a study of language and age in twitter.” in Proceedings
of the Seventh International AAAI Conference on Weblogs and Social
Media (ICWSM), 2013, pp. 439–448.
[22] W. Liu and D. Ruths, “What’s in a name? using first names as features
for gender inference in twitter.” in Proceedings of the AAAI Spring
Symposium: Analyzing Microtext, 2013, pp. 10–16.
[23] A. Culotta, N. R. Kumar, and J. Cutler, “Predicting the demographics
of twitter users from website traffic data,” in Twenty-Ninth AAAI
Conference on Artificial Intelligence, 2015, pp. 72–78.
[24] P. Velardi, G. Stilo, A. E. Tozzi, and F. Gesualdo, “Twitter mining for
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
fine-grained syndromic surveillance,” Artificial Intelligence in Medicine,
vol. 61, no. 3, pp. 153–163, 2014.
M. J. Paul and M. Dredze, “You are what you tweet: Analyzing twitter
for public health.” in Proceedings of the Fifth AAAI International
Conference on Weblogs and Social Media (ICWSM), 2011, pp. 265–
272.
M. Dredze, “How social media will change public health,” Intelligent
Systems, IEEE, vol. 27, no. 4, pp. 81–84, 2012.
E. Aramaki, S. Maskawa, and M. Morita, “Twitter catches the flu:
Detecting influenza epidemics using twitter,” in Proceedings of the
Conference on Empirical Methods in Natural Language Processing, ser.
EMNLP ’11. Association for Computational Linguistics, 2011, pp.
1568–1576.
H. Park, S. Rodgers, and J. Stemmle, “Analyzing health organizations’
use of twitter for promoting health literacy,” Journal of health communication, vol. 18, no. 4, pp. 410–425, 2013.
R. Teodoro and M. Naaman, “Fitter with twitter: Understanding personal
health and fitness activity in social media,” in Proceedings of the Seventh
AAAI International Conference on Weblogs and Social Media (ICWSM),
2013, pp. 611–620.
C. C. Freifeld, J. S. Brownstein, C. M. Menone, W. Bao, R. Filice,
T. Kass-Hout, and N. Dasgupta, “Digital drug safety surveillance:
Monitoring pharmaceutical products in twitter,” Drug Safety, vol. 37,
no. 5, pp. 343–350, 2014.
M. Hefler, B. Freeman, and S. Chapman, “Tobacco control advocacy in
the age of social media: using facebook, twitter and change,” Tobacco
control, vol. 22, no. 3, pp. 210–214, 2013.
M. Myslı́n, S.-H. Zhu, W. Chapman, and M. Conway, “Using twitter
to examine smoking behavior and perceptions of emerging tobacco
products,” Journal of medical Internet research, vol. 15, no. 8, 2013.
[33] J. Huang, R. Kornfield, G. Szczypka, and S. L. Emery, “A cross-sectional
examination of marketing of electronic cigarettes on twitter,” Tobacco
control, vol. 23, no. suppl 3, pp. iii26–iii30, 2014.
[34] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, 2011.
[35] S. Baccianella, A. Esuli, and F. Sebastiani, “SentiWordNet 3.0: An
enhanced lexical resource for sentiment analysis and opinion mining.”
in LREC, vol. 10, 2010, pp. 2200–2204.
[36] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,”
the Journal of machine Learning research, vol. 3, pp. 993–1022, 2003.
[37] A. K. McCallum, “MALLET: A machine learning for language toolkit,”
2002, http://mallet.cs.umass.edu.