Analysis of Wikileaks Cables Using NLP Techniques CS671: Natural language Processing Arpit Jain Sugam Anand Mentor : Dr . Amitabha Mukerjee
Why Wikileaks ? Wikileaks embassy cables revelations covered a huge dataset of official documents counting around 251,287 , from more than 250 worldwide US embassies and consulates. The cables show the extent of US spying on its allies and the UN; turning a blind eye to corruption and human rights abuse in "client states"; backroom deals with supposedly neutral countries; lobbying for US corporations; and the measures US diplomats take to advance those who have access to them. Such a huge, rich and structured dataset can be analyzed with natural language and Information retrieval techniques.
Distribution of cables http://wikileaks.org/cablegate.html
Structure of Cables Cable contains : Source : Embassy which sent the cable: Destination : Target Embassies Date : Sending date Body : Containing the raw text Tags : Containing meta information regarding cable like classified,unclassified or secret etc.
Objective Diplomats communicated about some topics referencing people,places ,organizations. Extract out these entities from the wikileaks. Guess what is the topic ? What is the Opinion of the diplomats (extends to america also) towards the topic. Map these over the timelines.
Methodology Get cables for multiple time periods for given embassies. Extract out the entities using NLTK Named Entity Recognizer or Stanford CoreNLP Toolkit Score these entities using their occurency frequency over the different cables for a particular time frame. Guess the topics using topic modelling approach like LDA, PLSA or LSI
Progress For Iran RPO Dubai Total 3853 entities like 'IRIG','supreme leader Khameni','Khatami','Mousavi','Islamic Revolution','Middle East'. For Islamabad 'Kashmir','Balochistan','Musharraf','North West Frontier Province' For New Delhi 'PM Manmohan Sibgh','BJP','NSSP','Tsunami Relief'
LDA Results for Islamabad Relief operation by UN ['0.211*"usaid/dart" + 0.178*"relief" + 0.115*"water" + 0.114*"earthquake" + 0.113*" shelter“ + 0.112*"tents" + 0.103*"october “ + 0.101*"u.n." + 0.097*"sanitation" + 0.095*"food"'] Existence of extremists in madrassa ["0.018*ssp + 0.016*( + 0.012*2005 + 0.010*groups + 0.010*domestic + 0.010*leaders + 0.010*extremist + 0.010*madrassa + 0.009*'s + 0.008*its", '0.000*rns. + 0.000*opened + 0.000*increase + 0.000*2005. + 0.000*receiving + 0.000*viable + 0.000*shows + 0.000*rebuilding + 0.000*e. + 0.000*jalil']
LDA Results for New Delhi Nuclear Deal ['0.115*"saran" + 0.113*"bjp" + 0.109*"nuclear" + 0.107*"congress" + 0.105*"jaishankar" + 0.103*"king" + 0.099*"pakistan" + 0.097*"nssp “ + 0.094*"nepal" + 0.080*"iraq"']
References @InProceedings{ oconnor-stewart-smith-13_extracting-intl-relations-from-political- context, author={O'Connor, Brendan and Stewart, Brandon M. and Smith, Noah A.}, title = {Learning to Extract International Relations from Political Context}, booktitle = {Proc. 51st ACL (Long papers)}, month = {August}, year = {2013}, pages = {1094--10104}, url = {http://www.aclweb.org/anthology/P13-1108} annote = { } }
Recommend
More recommend