This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2015.2467554, IEEE Transactions on Visualization and Computer Graphics
An Uncertainty-Aware Approach for Exploratory Microblog RetrievalMengchen Liu, Shixia Liu, Xizhou Zhu, Qinying Liao, Furu Wei, and Shimei PanLegend:Upper Extreme h , > ∥ , Lower Extreme#pjnet#makedclisten
0
1.0#alaska
#congress
#lnyhbt
#t2sda
#science
#breaking#furlough#ff
#budget#democrats#debtceiling#dems#debt#debtlimit#us#usa#cr#tcot#wwiimemorial
#barrycades#veterans#spitehouse#healthcare#impeachobama
#benghazi#tlot#military#ebt
#nationalparks#dems#pjnet#tcot
#politics#cnn
#dc
(b)#spitehouse
#default#government
#irs#obamacare#obama
#texas#gopshutdown#gop#edshow#msnbc
#science
#nationalparks#news
A
#shutdown#govtshutdown#obamashutdown#senate
#tgdn#truth
#jobs#economy#getcovered
#military#tlot#ebt
#spitehouse(d)
#military
#republicans#maddow#republican#teaparty#p2
#libcrib#demandavote#retweet#potus
#obamacare#obama#sot#jobs#economy
#ebt(e)
#immigration#america#fail
#dearcongress
#cspanchat#house
#topprog#sequester#uniteblue#vote#stoprush#enoughalready#aca#koch#wic#1u#endthisnow#inners#tedcruz#cleancr#boehner#justvote
#obamacare#getcovered#doctorwho#maddow#teaparty#p2
#teaparty
(a)
(c)
(f)
Fig. 1. Exploratory retrieval of the government shutdown dataset: (a) the hashtag graph with uncertainty and its propagation; (b) uncertainty propagation; (c)-(f) interactive ranking re nement results. Abstract— Although there has been a great deal of interest in analyzing customer opinions and breaking news in microblogs, progress has been hampered by the lack of an effective mechanism to discover and retrieve data of interest from microblogs. To address this problem, we have developed an uncertainty-aware visual analytics approach to retrieve salient posts, users, and hashtags. We extend an existing ranking technique to compute a multifaceted retrieval result: the mutual reinforcement rank of a graph node, the uncertainty of each rank, and the propagation of uncertainty among different graph nodes. To illustrate the three facets, we have also designed a composite visualization with three visual components: a graph visualization, an uncertainty glyph, and a ow map. The graph visualization with glyphs, the ow map, and the uncertainty analysis together enable analysts to effectively nd the most uncertain results and interactively re ne them. We have applied our approach to several Twitter datasets. Qualitative evaluation and two real-world case studies demonstrate the promise of our approach for retrieving high-quality microblog data. Index Terms—microblog data, mutual reinforcement model, unc
ertainty modeling, uncertainty visualization, uncertainty propagation.
1 I NTRODUCTION Microblogs such as Twitter and Facebook are among the most popular platforms for people to share their daily observations and thoughts, M. Liu and S. Liu are with Tsinghua University. E-mail: simon900314@, shixia@. S. Liu is the corresponding author. X. Zhu is with USTC. E-mail: ezra0408@. Q. Liao and F. Wei are with Microsoft. E-mail:{qiliao,fuwei}@. S. Pan is with University of Maryland, Baltimore County. E-mail: shimei@umbc.edu Manuscript received 31 Mar. 2015; accepted 1 Aug. 2015; date of publication xx Aug. 2015; date of current version 25 Oct. 2015. For information on obtaining reprints of this article, please send e-mail to: tvcg@.
including personal status updates and opinions regarding products or government policies. Since the crowd in microblogs provides many individual comments/opinions that were not available before, businesses and organizations have begun to leverage microblogs to pro le customers, derive brand perception, gauge citizen sentiments, and predict the stock market[23, 34, 41, 53]. For example, retailers track and examine relevant microblog posts to understand customer opinion toward their products and services. In spite of the growing interest in quickly analyzing customer opinions or breaking news in microblogs, progress has been hampered by the lack of an effective mechanism to retrieve data of interest from microblogs. For this reason, researchers have developed a number of microblog retrieval methods[6, 17]. The main goal is to generate a list of k microblog posts that are relevant to the information needs represented by a query q. Although these methods have successfully retrieved
1077-2626 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See /publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVCG.2015.2467554, IEEE Transactions on Visualization and Computer Graphics
relevantposts,theyhavetwomajordrawbacks.First,theuniquecharacteristicsofmicroblogdataarenotcomprehensivelyconsideredtoimproveretrievalperformance.Posts,users,andhashtagsarethreekeydimensionsofmicroblogdata.Thesedimensionsarenotindependent,astheyoftenin uenceoneanother.Forexample,apostpublishedbyanin uentialuserandlabeledwithapopularhashtagtendstobesalient.However,theexistingapproachestomicroblogretrievaldonottightlyintegratethethreedimensionsanddonottakeadvantageoftherelationshipsamongthem.Theseapproacheseitherconsideronlythepostortreatitastheprimarydimensionandtheothersassecondarydimensionsto lterposts.Forexample,ScatterBlogs2[6] ndspostsofinterestbycheckingwhetherthepostscontainacertainhashtag.Second,existingmethodsdonotaddressuncertaintyintheretrievalmodels.Improvingthemodelingandpresentationofuncertaintycanhelpdescriberetrieveddatamoreaccurately,whichcaninturnassistanalystsinmakingmoreinformeddecisions[14,39,49].
Toaddresstheaboveproblems,wehavedevelopedanuncertainty-awaremicroblogretrievaltoolkit,MutualRanker,toestimateuncertaintyintroducedbytheanalysisalgorithmaswellastoquicklyretrievesalientposts,users,andhashtags.Sincepostsaresharedandpropagatedonsocialnetworks,theauthorityofanauthorandthepopularityofahashtagplayanimportantroleindeterminingtheimportanceofapostandviceversa.Accordingly,weformulateuncertainty-awaremicroblogretrievalusinganuncertainty-basedmutualreinforcementgraphmodel(MRG)[16,45],wherethecontentqualityofposts,thesocialin uenceofusers,andthepopularityofhashtagsmutuallyreinforceoneanother.WeadoptaMonteCarlosamplingmethodtosolveMRGbecauseofitseffectivelocalupdatemechanism,fastconvergence,andprobability-baseduncertaintyformalization[2].Adispersion-basedmeasureisusedtoestimatetheuncertaintygeneratedbytheMonteCarlosamplingmethod.Inaddition,wemodeltheuncertaintypropagationasaMarkovchain.Tohelpanalystsunderstandtheretrieveddata,wehavedesignedacompositevisualization[20].Speci cally,adensity-basedgraphvisualizationhasbeendevelopedtovisuallyillustrateposts,users,hashtags,andtheirrelationships.Anuncertaintyglyphanda owmapareemployedtorepresentuncertaintyanditspropagationonagraph(Fig.1).Thethreevisualizationcomponents,togetherwiththeuncertaintyanalysis,enableanalyststoquicklydetectthemostuncertainresultsandinteractivelyresolvethem.TheMonteCarlosamplingmethodisthenusedtoincrementallymodifytherankingresultstomeetuserneeds.
Insummary,ourworkpresentsthreetechnicalcontributions:
Anuncertain-awaremicroblogretrievalmodelthatextractssalientposts,users,andhashtags.Thismodelalsocomputestheassociateduncertaintyanditspropagationamonggraphnodes. Acompositevisualizationthatenablesuserstounderstandthethree-level,mutualreinforcementrankingresults,theassociateduncertainty,anduncertaintypropagationpatterns.
Avisualanalyticssystemthathelpsusersquicklyretrievedataofinterest,aswellasanalyzeandunderstandtherankingresultsinaninteractiveanditerativeprocess.2RELATEDWORK
2.1MicroblogRetrieval
Inthe eldofdatamining,anumberofapproacheshavebeenpro-posedtoretrievedatafrommicroblogs.AcomprehensivesurveywaspresentedbyCherichiandFaiz[12].Mostrecentworkcanbecatego-rizedintotwogroups:vector-space-basedapproachesandlinkanalysisapproaches.
Thevector-space-basedapproachemploystwofeaturevectorstorepresentaqueryandapost.Asimilaritymeasure(e.g.,cosinesimi-larity)isthenadoptedtoestimatethesimilaritybetweenthepostandthequery.TherehavebeensomerecentresearcheffortsthatexploitadditionalstructuralfeaturessuchasURLsandhashtagstoenhanceretrievalperformance[1,29,31].
Recently,totakeadvantageofthelinkstructureofsocialnetworks,researchershaveintroducedthePageRankalgorithm[7]inmicroblogretrieval.Forexample,TwitterRank[46]adoptsthefollower-followee
linkstructureandthePageRankalgorithmtoidentifyin uentialusers.Duanetal.[16]modeledthetweet-rankingproblemasanMRG[45],wherethesocialin uenceofusersandthecontentqualityoftweetsmu-tuallyreinforceeachother.Speci cally,thepostgraph,theusergraph,andthehashtaggraph,aswellastherelationshipsbetweenthethreegraphs,wereusedtoretrievesalientposts,users,andhashtags.Weex-tendthisapproachbyexplicitlymodelingtheuncertaintyoftherankingresult,aswellasitspropagationonthetweet/user/hashtaggraph.Inthe eldofvisualanalytics,agreatdealofresearchhasbeenconductedonvisuallyanalyzingmicroblogdata.Themethodsappliedincludeeventdetection[30],topicextractionandanalysis[25,40,50],informationdiffusion[8,52],sentimentanalysis[47,48],andrevenue/stockprediction[28,34].However,fewstudieshavefocusedonmicroblogretrieval.
Boschetal.[6]developedScatterBlogs2toextractmicroblogpostsofinterest.Itallowsanalyststobuildcustomizedpost ltersandclas-si ersinteractively.These ltersandclassi ersarethenutilizedtosupportreal-timepostmonitoring.Inpost ltering,thepostdimensionisconsideredtheprimarydimensionandthehashtagthesecondarydimension.Incontrast,wetightlyintegratetheposts,users,andhash-tagsintheMRGmodelandusethemodeltoretrievehigh-qualitymicroblogdata.Moreover,wealsomodeluncertaintyintheretrievalprocess.Sinceanalystscaninteractivelyre nethemodel,wecanfurtherimproveretrievalqualitybyleveragingtheuncertaintyformal-izationandanalysts’knowledge.
2.2InteractiveUncertaintyAnalytics
Frequently,uncertaintyisintroducedintovisualanalyticswhendataisacquired,transformed,orvisualized[14,24,27].Anumberofuncer-taintyanalysismethodshavebeenproposed,whichcanbecategorizedintotwogroups:uncertaintyvisualizationanduncertaintymodeling.Manystudiesonuncertaintyvisualizationhavebeenconductedinthe eldofgeographicvisualizationandscienti cvisualization[32,37,42].Typicaluncertaintyrepresentationtechniquesincludetheadditionofglyphsandgeometry,themodi cationofgeometryandattributes,an-imation,soni cation,andpsycho-visualapproaches[32].Recently,researchersareincreasinglyinterestedinthedesignofuncertaintyrep-resentationsforinformationvisualizationandvisualanalytics.Forexample,Collinsetal.[13]designedtwoalternatives,thegradientborderandthebubbleborder,toillustrateuncertaintyinlatticegraphs.Wuetal.[48]developedacircularwheelrepresentationandsubjectivelogictoconveyuncertaintyincustomerreviewanalysis.Slingsbyetal.[38]utilizedbarchartstorevealtheuncertaintyassociatedwithgeodemographicclassi ers.Torepresentuncertaintyinaggregatedvertexsets,Vehlowetal.[43]consideredthelightnessandshapeofthenode.Chenetal.[10]paredwiththesemethods,MutualRankernotonlyvisualizesun-certainty,butalsoitspropagationonagraph.Wealsosupportuserstointeractivelymodifytheuncertainresult.
Anothertypeofuncertaintyvisualizationrepresentstheuncertaintyintheanalysisprocess.ZukandCarpendale[55]studiedissuesrelatedtouncertaintyinreasoninganddeterminedthetypeofvisualsupportrequired.Correaetal.[14]developedaframeworktorepresentandquantifytheuncertaintyinthevisualanalyticsprocess.Wuetal.[49]extendedthisframeworktoshowtheuncertainty owintheanalysisprocess.Bycontrast,ourworkaimstomodeluncertaintyinmicroblogretrieval.Wefocusonvisuallyillustratingtopologicaluncertaintypropagationonagraphandondesigninganiterativevisualanalyticsprocesstoactivelyengageanalystsinreducingoveralluncertainty.Probabilitytheory,fuzzysettheory,roughsettheory,andevidencetheoryarefourmajorapproachestomodeluncertainty[54].Amongtheseapproaches,probabilitytheoryisthemostcommonlyusedmethodinvisualanalytics.Forexample,Correa[14]andWuetal.[49]re-gardeduncertaintyasaparameterthatdescribesthedispersionofmea-suredvalues.Speci cally,theyrepresenteduncertaintyasanestimatedstandarddeviation,inwhichthemeasuredvalueisde nedonthesetofbothpositiveandnegativerealnumbers.Sincethemeasuredvalue(therankingscore)inourapproachisde nedonthesetofpositivereal