Data and Methodology
Open Data
Big Data
Stage 1: Finding the data sources
-
Monthly Tourism Statitics (I-92) from the U.S Department of Commerce - National Travel & Tourism Office
-
Country Description & Safety Information from U.S Government open data.
​
Stage 2: Data Cleaning & Exploration
​
Stage 3: Analyze data and create visualization with Carto, CartoVL, Tableau, Infogram, and Excel
​
​
​
​
​​
​
Stage 1: Finding the data sources
-
Twitter posts that have both #travel AND #terrorism
​
Stage 2: Parsing HTML and JSON files
​
Stage 3: Analyze tweets by Sentiment Analysis & Node Visualization with Jupyter Notebook and Kumu.io
​
​​
​
​
For both Monthly Tourism Statistics and Global Terrorism Database files, some variable crossovers need to be identified for proper aggregation & comparison.
Data Cross Over
Regions defined in Monthly Tourism Statistics​:​
-
South America
-
Central America
-
Caribbean
-
Europe
-
Middle East
-
Africa
-
Asia
-
Oceania
Regions defined in Global Terrorism Data​:
-
North America
-
South America
-
Central America & Caribbean
-
Western Europe
-
Eastern Europe
-
Middle East & North Africa
-
Sub Saharan Africa
-
Central Asia
-
South Asia
-
East Asia
-
Southeast Asia
-
Australia & Oceania
For this reason and given the time limitation of this assignment, I am going to aggregate data using this combined list:
​​
1. North America
Canada, Mexico, United States
​
2. Central America & Caribbean
Antigua and Barbuda, Bahamas, Barbados, Belize, Cayman Islands, Costa Rica, Cuba, Dominica, Dominican Republic, El Salvador, Grenada, Guadeloupe, Guatemala, Haiti, Honduras, Jamaica, Martinique, Nicaragua, Panama, St. Kitts and Nevis, St. Lucia, Trinidad and Tobago
​
3. South America
Argentina, Bolivia, Brazil, Chile, Colombia, Ecuador, Falkland Islands, French Guiana, Guyana, Paraguay, Peru, Suriname, Uruguay, Venezuela
​
4. Asia
East Asia: China, Hong Kong, Japan, Macau, North Korea, South Korea, Taiwan
​
Southeast Asia: Brunei, Cambodia, East Timor, Indonesia, Laos, Malaysia, Myanmar, Philippines, Singapore, South Vietnam, Thailand, Vietnam
​
South Asia: Afghanistan, Bangladesh, Bhutan, India, Maldives, Mauritius, Nepal, Pakistan, Sri Lanka
​
Central Asia: Armenia, Azerbaijan, Georgia, Kazakhstan, Kyrgyzstan, Tajikistan, Turkmenistan, Uzbekistan
​
5. Europe
Western Europe: Andorra, Austria, Belgium, Cyprus, Denmark, Finland, France, Germany, Gibraltar, Greece, Iceland, Ireland, Italy, Luxembourg, Malta, Netherlands, Norway, Portugal, Spain, Sweden, Switzerland, United Kingdom, Vatican City, West Germany (FRG)
​
Eastern Europe: Albania, Belarus, Bosnia-Herzegovina, Bulgaria, Croatia, Czech Republic, Czechoslovakia, East Germany (GDR), Estonia, Hungary, Kosovo, Latvia, Lithuania, Macedonia, Moldova, Montenegro, Poland, Romania, Russia, Serbia, Serbia-Montenegro, Slovak Republic, Slovenia, Soviet Union, Ukraine, Yugoslavia
​
6. Middle East & Africa
Middle East & North Africa: Algeria, Bahrain, Egypt, Iran, Iraq, Israel, Jordan, Kuwait, Lebanon, Libya, Morocco, North Yemen, Qatar, Saudi Arabia, South Yemen, Syria, Tunisia, Turkey, United Arab Emirates, West Bank and Gaza Strip, Western Sahara, Yemen
​
Sub-Saharan Africa: Angola, Benin, Botswana, Burkina Faso, Burundi, Cameroon, Central African Republic, Chad, Comoros, Democratic Republic of the Congo, Djibouti, Equatorial Guinea, Eritrea, Ethiopia, Gabon, Gambia, Ghana, Guinea, Guinea-Bissau, Ivory Coast, Kenya, Lesotho, Liberia, Madagascar, Malawi, Mali, Mauritania, Mozambique, Namibia, Niger, Nigeria, People's Republic of the Congo, Republic of the Congo, Rhodesia, Rwanda, Senegal, Seychelles, Sierra Leone, Somalia, South Africa, South Sudan, Sudan, Swaziland, Tanzania, Togo, Uganda, Zaire, Zambia, Zimbabwe
​
7. Australasia & Oceania
Australia, Fiji, French Polynesia, New Caledonia, New Hebrides, New Zealand, Papua New Guinea, Solomon Islands, Vanuatu, Wallis and Futuna
​
​
Methodology
Map 1
From the country shapefile from ESRI, I uploaded to Carto and manually added the U.S Department of State travel advisory level for each country. However, since the rankings are updated in real time when an event of international concern takes place such as terrorist attack, the dataset on Carto is not updated in real time, it is made after the Sri Lanka attack on 2019 Easter Day. I downloaded the terrorist attacks from Global Terrorism Database, cleaned it to include only events from 1999 to 2017, uploaded to Carto and did the analysis using calculate clusters of points to overlay the color coded layer of countries with the aggregated attacks from from 1999 to 2017.
Obstacle: Carto kept giving me the error code "Analysis B1 failed: Analysis cache space limits exceeded" when I tried to use the "Calculate Clusters of Points" and "Create Centroids" analysis.
Line chart series1 & 2 & 3
For line chart 1: From the upload Carto dataset for all the attacks from 1999 to 2017, I used Carto to calculate the total number of events per region per year and recorded the results to excel. I then uploaded the results to Tableau and made a line chart for each region.
For line chart 2, I calculated the flight percentage change for each region of flights per year and used Tableau again to create a line chart for each region.
For line chart 3, I calculated the percentages of the total flights to all regions of the world and found that the most visited regions are Europe, Mexico, Canada and Asia. I decided to include only these regions on the line chart for clarity and mention the lower percentages in the text.
Obstacle: I did many double-checks to make sure my numbers of percentage change and results of total attacks per regions are correct before and after making the line charts. There were some issues with Wix templates and creating the gallery of embedded html.
Linear Regression Graphs
I gathered, cleaned and aggregated the data for each region. I used the CORR() function and RSQ() function to find the R and R Square. For regions like Middle East & Africa, I had to total their flights in order to compare to the data for terrorist attacks. The same is done for Central America and Caribbean.
Obstacle: I had to make sure the X and Y axis have the correct data, I doubled check the R and R Square results. It was taking a very long time to finally arrived at the place where I could present the findings regardless if they do not confirm my research hypothesis.
Map 2
From the Terrorism Global Data excel file, used Carto VL and Visual Studio to write the codes for a visualization that would allow me to adjust the resolution of points per grid cell, which helps me in dealing with too many data points. I wanted to display color coded points for regions including Middle East & North Africa, South Asia, Western Europe, Sub-Saharan Africa, Eastern Europe, South and North America. These regional points would also be shown in a time series starting from 1999 and ending in 2017.
Obstacle: I could not delete the decimals after the year as it is the process of updating. I needed more time to figure out this issue but I had to move on due to project time constraints.
Word Cloud
I downloaded the Travel Warning json from open data.gov and imported to Jupyter notebook. I was interested in looking at the columns for geopolitical area of country name, the columns for destination description and the safety and security. There were 211 countries total. I updated the STOPWORDS dictionary to exclude words and tags hat aren't essential like "but", "href", "There". I plotted the remaining keywords on the figure after filtering for the stopwords, I then expanded the size of the figure to include as many keywords as possible.
Obstacle: I could not add to the lexicon of STOPWORDS for the longest, no online sources helped at the time but I was able to manually add the words by creating the list directly.
Twitter Sentiment Analysis
For the Twitter posts of 136 tweets since June 2009 to April 2018, I was able to scrape the HTML and parse it with BeautifulSoup library in Python. I also used regex for additional filtering to display only the text and hashtags.
After the tweets are filtered, I used Vader sentiment analysis by running them through the lexicon with nltk stopwords and printing results based on the compound scores for each tweets.
Obstacle: Instead of using the standard Twitter API, which only yields 7 days worth of tweets, and given the limited time I have with this assignment, I decided to search for #travel and #terrorism on Twitter itself. The result page is the HTML page I downloaded and scraped. The HTML file was very messy, each tweet is nested in a tag with many other elements such as links and hashtags.
Disclaimer: The peer advising team of Data Science recommended the use of regex.