Data and Methodology

Open Data

Big Data

Stage 1: Finding the data sources

Monthly Tourism Statitics (I-92) from the U.S Department of Commerce - National Travel & Tourism Office
Global Terrorism Database (GTD
Country Description & Safety Information from U.S Government open data.

Stage 2: Data Cleaning & Exploration

Stage 3: Analyze data and create visualization with Carto, CartoVL, Tableau, Infogram, and Excel

Stage 1: Finding the data sources

Twitter posts that have both #travel AND #terrorism

Stage 2: Parsing HTML and JSON files

Stage 3: Analyze tweets by Sentiment Analysis & Node Visualization with Jupyter Notebook and Kumu.io

For both Monthly Tourism Statistics and Global Terrorism Database files, some variable crossovers need to be identified for proper aggregation & comparison.

Data Cross Over

Regions defined in Monthly Tourism Statistics:

South America
Central America
Caribbean
Europe
Middle East
Africa
Asia
Oceania

Regions defined in Global Terrorism Data:

North America
South America
Central America & Caribbean
Western Europe
Eastern Europe
Middle East & North Africa
Sub Saharan Africa
Central Asia
South Asia
East Asia
Southeast Asia
Australia & Oceania

For this reason and given the time limitation of this assignment, I am going to aggregate data using this combined list:

1. North America
Canada, Mexico, United States

2. Central America & Caribbean
Antigua and Barbuda, Bahamas, Barbados, Belize, Cayman Islands, Costa Rica, Cuba, Dominica, Dominican Republic, El Salvador, Grenada, Guadeloupe, Guatemala, Haiti, Honduras, Jamaica, Martinique, Nicaragua, Panama, St. Kitts and Nevis, St. Lucia, Trinidad and Tobago

3. South America
Argentina, Bolivia, Brazil, Chile, Colombia, Ecuador, Falkland Islands, French Guiana, Guyana, Paraguay, Peru, Suriname, Uruguay, Venezuela

4. Asia

East Asia: China, Hong Kong, Japan, Macau, North Korea, South Korea, Taiwan

Southeast Asia: Brunei, Cambodia, East Timor, Indonesia, Laos, Malaysia, Myanmar, Philippines, Singapore, South Vietnam, Thailand, Vietnam

South Asia: Afghanistan, Bangladesh, Bhutan, India, Maldives, Mauritius, Nepal, Pakistan, Sri Lanka

Central Asia: Armenia, Azerbaijan, Georgia, Kazakhstan, Kyrgyzstan, Tajikistan, Turkmenistan, Uzbekistan

5. Europe

Western Europe: Andorra, Austria, Belgium, Cyprus, Denmark, Finland, France, Germany, Gibraltar, Greece, Iceland, Ireland, Italy, Luxembourg, Malta, Netherlands, Norway, Portugal, Spain, Sweden, Switzerland, United Kingdom, Vatican City, West Germany (FRG)

Eastern Europe: Albania, Belarus, Bosnia-Herzegovina, Bulgaria, Croatia, Czech Republic, Czechoslovakia, East Germany (GDR), Estonia, Hungary, Kosovo, Latvia, Lithuania, Macedonia, Moldova, Montenegro, Poland, Romania, Russia, Serbia, Serbia-Montenegro, Slovak Republic, Slovenia, Soviet Union, Ukraine, Yugoslavia

6. Middle East & Africa

Middle East & North Africa: Algeria, Bahrain, Egypt, Iran, Iraq, Israel, Jordan, Kuwait, Lebanon, Libya, Morocco, North Yemen, Qatar, Saudi Arabia, South Yemen, Syria, Tunisia, Turkey, United Arab Emirates, West Bank and Gaza Strip, Western Sahara, Yemen

Sub-Saharan Africa: Angola, Benin, Botswana, Burkina Faso, Burundi, Cameroon, Central African Republic, Chad, Comoros, Democratic Republic of the Congo, Djibouti, Equatorial Guinea, Eritrea, Ethiopia, Gabon, Gambia, Ghana, Guinea, Guinea-Bissau, Ivory Coast, Kenya, Lesotho, Liberia, Madagascar, Malawi, Mali, Mauritania, Mozambique, Namibia, Niger, Nigeria, People's Republic of the Congo, Republic of the Congo, Rhodesia, Rwanda, Senegal, Seychelles, Sierra Leone, Somalia, South Africa, South Sudan, Sudan, Swaziland, Tanzania, Togo, Uganda, Zaire, Zambia, Zimbabwe

7. Australasia & Oceania

Australia, Fiji, French Polynesia, New Caledonia, New Hebrides, New Zealand, Papua New Guinea, Solomon Islands, Vanuatu, Wallis and Futuna

Methodology

Map 1

From the country shapefile from ESRI, I uploaded to Carto and manually added the U.S Department of State travel advisory level for each country. However, since the rankings are updated in real time when an event of international concern takes place such as terrorist attack, the dataset on Carto is not updated in real time, it is made after the Sri Lanka attack on 2019 Easter Day. I downloaded the terrorist attacks from Global Terrorism Database, cleaned it to include only events from 1999 to 2017, uploaded to Carto and did the analysis using calculate clusters of points to overlay the color coded layer of countries with the aggregated attacks from from 1999 to 2017.

Obstacle: Carto kept giving me the error code "Analysis B1 failed: Analysis cache space limits exceeded" when I tried to use the "Calculate Clusters of Points" and "Create Centroids" analysis.

Line chart series1 & 2 & 3

For line chart 1: From the upload Carto dataset for all the attacks from 1999 to 2017, I used Carto to calculate the total number of events per region per year and recorded the results to excel. I then uploaded the results to Tableau and made a line chart for each region.

For line chart 2, I calculated the flight percentage change for each region of flights per year and used Tableau again to create a line chart for each region.

For line chart 3, I calculated the percentages of the total flights to all regions of the world and found that the most visited regions are Europe, Mexico, Canada and Asia. I decided to include only these regions on the line chart for clarity and mention the lower percentages in the text.

Obstacle: I did many double-checks to make sure my numbers of percentage change and results of total attacks per regions are correct before and after making the line charts. There were some issues with Wix templates and creating the gallery of embedded html.

Linear Regression Graphs

I gathered, cleaned and aggregated the data for each region. I used the CORR() function and RSQ() function to find the R and R Square. For regions like Middle East & Africa, I had to total their flights in order to compare to the data for terrorist attacks. The same is done for Central America and Caribbean.

Obstacle: I had to make sure the X and Y axis have the correct data, I doubled check the R and R Square results. It was taking a very long time to finally arrived at the place where I could present the findings regardless if they do not confirm my research hypothesis.

Map 2

From the Terrorism Global Data excel file, used Carto VL and Visual Studio to write the codes for a visualization that would allow me to adjust the resolution of points per grid cell, which helps me in dealing with too many data points. I wanted to display color coded points for regions including Middle East & North Africa, South Asia, Western Europe, Sub-Saharan Africa, Eastern Europe, South and North America. These regional points would also be shown in a time series starting from 1999 and ending in 2017.

Obstacle: I could not delete the decimals after the year as it is the process of updating. I needed more time to figure out this issue but I had to move on due to project time constraints.

Word Cloud

I downloaded the Travel Warning json from open data.gov and imported to Jupyter notebook. I was interested in looking at the columns for geopolitical area of country name, the columns for destination description and the safety and security. There were 211 countries total. I updated the STOPWORDS dictionary to exclude words and tags hat aren't essential like "but", "href", "There". I plotted the remaining keywords on the figure after filtering for the stopwords, I then expanded the size of the figure to include as many keywords as possible.

Obstacle: I could not add to the lexicon of STOPWORDS for the longest, no online sources helped at the time but I was able to manually add the words by creating the list directly.

Twitter Sentiment Analysis

For the Twitter posts of 136 tweets since June 2009 to April 2018, I was able to scrape the HTML and parse it with BeautifulSoup library in Python. I also used regex for additional filtering to display only the text and hashtags.

After the tweets are filtered, I used Vader sentiment analysis by running them through the lexicon with nltk stopwords and printing results based on the compound scores for each tweets.

Obstacle: Instead of using the standard Twitter API, which only yields 7 days worth of tweets, and given the limited time I have with this assignment, I decided to search for #travel and #terrorism on Twitter itself. The result page is the HTML page I downloaded and scraped. The HTML file was very messy, each tweet is nested in a tag with many other elements such as links and hashtags.

Disclaimer: The peer advising team of Data Science recommended the use of regex.

What's the market response?