Projects

Analysis Presi

_Company_ Inclusivity & Equality Analysis.001.jpeg

_Company_ Inclusivity & Equality Analysis.002.jpeg

Data Clean SQL

Analysis Presentation

Data Cleaning SQL- BigQuery
Project 1

Information: This is a customer data set with customer names, IDs, addresses, cities, states, zip codes, and countries. This Data set was given to me through the Google Data Analytics Professional Certification Course.

Steps Taken:

1. Downloaded dataset into SQL BigQuery

2. I was asked to add a customer to the dataset and I added a row of data using the INSERT INTO and VALUES functions in this query:

INSERT INTO

`my-first-project-sql-363013.customer_data.customer_adress`

(customer_id, address, city, state, zipcode, country)

VALUES

(2645,'3456 Highfield Court', 'Jackson', 'MI',49202, "US")

3. I was asked to update a current customer address so I updated it using this query

UPDATE

`my-first-project-sql-363013.customer_data.customer_adress`

SET

address = '2346 New Haven Ct'

WHERE

customer_id = 4957

4. I wrote this next query to find out if there were any mistakes in the data and found some rows were showing a LENGTH of 3 and not 2 in the country row

SELECT

LENGTH(country) AS letters_in_country

FROM

`my-first-project-sql-363013.customer_data.customer_adress`

5. I wrote this query to find the mistake where LENGTH was 3 and not 2

SELECT

country

FROM

`my-first-project-sql-363013.customer_data.customer_adress`

WHERE

LENGTH(country) > 2

6. I found that someone imputed ‘USA’ Instead of ‘US’ in this query

SELECT

country

FROM

`my-first-project-sql-363013.customer_data.customer_adress`

WHERE

LENGTH(country) > 2

7. In order to view the customer_id with ‘US’ and ‘USA’ in the country column I used the SUBSTRING function and the DISTINCT function because there are duplicate customer_id in this query:

SELECT

DISTINCT customer_id

FROM

`customer_data.customer_adress`

WHERE

SUBSTRING(country, 1, 2) = "US"

8. I then write this query to find out if the same mistake has been made in the state column

SELECT

state

FROM

`my-first-project-sql-363013.customer_data.customer_adress`

WHERE

LENGTH(state) > 2

9. I find out that one OH has extra whitespace behind it and I use the trim function in the query to trim the whitespace

SELECT

DISTINCT customer_id

FROM

`customer_data.customer_adress`

WHERE

TRIM(state)='OH'

Information: This is a data set with information on the number of doors, fuel type, horsepower, etc.

Description of data: https://archive.ics.uci.edu/ml/datasets/Automobile

1. Download data onto BigQuery

2. Inspected data to make sure it matched the description using this query

SELECT

DISTINCT fuel_type

FROM

Cars.car_info

3. I found it matched the description so I moved on to see if the min and max length are the same as the description using this query

SELECT

MIN(length) AS min_length,

MAX(length) AS max_length

FROM

Cars.car_info

4. I then look for ‘null’ values in the num_of_doors

SELECT

*

FROM

Cars.car_info

WHERE

num_of_doors IS NULL

5. I find that there are 2 null values and checked with stakeholders and found that all the null values should be say “four” so I write this query to fill in the null

UPDATE

Cars.car_info

SET

num_of_doors = "four"

WHERE

make = "dodge"

AND fuel_type = "gas"

AND body_style = "sedan"

6. I do further inspection and use this query

SELECT

DISTINCT num_of_cylinders

FROM

Cars.car_info

7. I find out that instead of ‘two’ one of the cells says ‘tow’ so I write this query to change it

UPDATE

Cars.car_info

SET

num_of_cylinders = "two"

WHERE

num_of_cylinders = "tow"

8. I now inspect the compression ratio column the see if the range is between 7 and 23 like the data description says so I run this query

SELECT

MIN(compression_ratio) AS min_compression_ratio,

MAX(compression_ratio) AS max_compression_ratio

FROM

Cars.car_info

9. I find out that the max is 70 and this should be deleted (after I consult the stakeholder and manager) so I write this query to remove it.

DELETE

Cars.car_info

WHERE

compression_ratio = 70

10. I then check the drive_wheels column to check for consistency using this query:

SELECT

DISTINCT drive_wheels

FROM

Cars.car_info

11. I find that 4wd shows up twice as a distinct input so I write this query to find out why

SELECT

DISTINCT drive_wheels,

LENGTH(drive_wheels) AS string_length

FROM

Cars.car_info

12. I find out that multiple of the 4wd has 4 lengths so I write this query to fix that

UPDATE

Cars.car_info

SET

drive_wheels = TRIM(drive_wheels)

WHERE TRUE

Data Cleaning in SQL- BigQuery
Project 2

Analysis Presentation

Data Analysis SQL- BigQuery
Project 1 - Getting to know the data with sorting and filtering

Information: This is data from the Google Data Analytics coursework. It is a list of movies, release dates, revenue, etc.

1. Filter the data to only show the comedy Genre that made more than $500,000,000

SELECT

Movie_Title, Genre, Revenue `movie_data.movies`

FROM

`movie_data.movies`

WHERE

Genre = 'Comedy'

AND

Revenue>500000000

2. Sort the data by release date ( DECS) and filter the to only show the comedy genre.

SELECT

*

FROM

`movie_data.movies`

WHERE

Genre = 'Comedy'

ORDER BY

Release_Date

DESC

3. What was the average revenue in the comedy genre?

SELECT

AVG(Revenue) AS average_comedy_rev

FROM

`my-first-project-sql-363013.movie_data.movies`

WHERE

Genre = "Comedy"

4. What genre had the highest average revenue?

SELECT

Genre,

AVG(Revenue) AS average_rev

FROM

`my-first-project-sql-363013.movie_data.movies`

GROUP BY

Genre

ORDER BY

average_rev

DESC

Here is the result:

Analysis SQL

Project 2 - Using JOINs and subqueries

Information: This data is from a bike share app, from BigQuery, I will show the questions the Stakeholders asked and then what query I wrote to find the answer.

1. Compare the average number of bikes available at each station to the average number of bikes available at individual stations.

SELECT

station_id,

num_bikes_available,

(

SELECT

AVG(num_bikes_available)

FROM

`bigquery-public-data.new_york.citibike_stations`

)

AS avg_num_bikes_available

FROM

bigquery-public-data.new_york.citibike_stations

This is the result:

2. Find the start stations with the most number of rides.

SELECT

station_id,

name,

num_of_rides AS number_of_rides_at_station

FROM

(

SELECT

start_station_id,

COUNT(*) num_of_rides

FROM

`bigquery-public-data.new_york.citibike_trips`

GROUP BY

start_station_id

)

AS station_num_trips

INNER JOIN

`bigquery-public-data.new_york.citibike_stations` ON station_id = start_station_id

ORDER BY

station_num_trips.num_of_rides DESC

This is the result:

3. What were the 10 most popular trips (start location to end location) and what was the average time of those trips?

SELECT

CONCAT (start_station_name,"to", end_station_name) AS route,

COUNT (*) AS num_trips,

ROUND (AVG (CAST (tripduration AS int64) / 60) ,2) AS duration

FROM

`bigquery-public-data.new_york.citibike_trips`

GROUP BY

start_station_name,end_station_name

ORDER BY

num_trips DESC

LIMIT

10

This is the result:

Tableau Viz

Tableau Visualizations

Project 1 - Electric Cars in Washington State

Information: This data is from data.gov and it shows the Battery Electric Vehicles (BEVs) and Plugin Hybrid Electric Vehicles (PHEVs) that are currently registered through the Washington State Department of Licensing (DOL).

Key insights:

1. The Jaguar has the highest Electric Range out of any of the other electric cars with a top range of 212.1 miles

2. The clear winning of the most popular electric car in Washington State is Tesla (47,264), followed by Nissan (12,909), and then Chevrolet (9,846).

3. The highest population of electric cars in Washington State is in Seattle.

4. The most popular Tesla in Washinton State is the Model 3.

Click on the picture to open up Tableau and interact with the Visualization

Project 2 - World Happiness

Information: This data is from the Google Data Analytics coursework. The data looks at elements of happiness and quantifies them. I took the data and made visualizations that compared the elements to find out what really made people happy as well as looked at European happiness as well.

Key Insights:

1. The top 4 things that influence a country's happiness are; GDP, family, freedom, and life-expectancy

2. Europe, overall, was a happier nation and had higher correlations between family and happiness, life expectancy and happiness, and GDP and happiness than compared to the rest of the world.

3. The happiest country according to the data was a 4-way tie with; Norway (7.5), Denmark (7.5), Iceland (7.5), and Switzerland (7.5).

Click on the picture to open up Tableau and interact with the Visualization.

Project 3- Movies

Information: This data is from the Google Data Analytics coursework. The data includes revenue, budget, movie titles, etc.

Key Insights:

1. The 3 most popular Genres of movies made were #1 Action, #2 Comedy, and #3 Drama.

2. The most profitable Genres were #1 Adventure, #2 Family, and #3 SciFi. Where Action, Comedy, and Drama come in at #6 (Action), #9 (Comedy), and #12 (Drama).

3. The most profitable move was Despicable Me 2 at $894,800,000

Screen Shot 2022-12-08 at 2.51.53 PM.png

Screen Shot 2022-12-08 at 3.06.42 PM.png

Analysis Presentation

Screen Shot 2022-12-16 at 12.59.44 PM.png

R Viz

R Visualizations

Project 1 - Palmer Penguins

Information: This data is from posit cloud and is called "palmerpenguins".

Key insights:

Male penguins are larger on average
Gentoos are the largest penguins
As flipper length goes up, so does body mass
Adelie penguins are more spread out in size than the chinstrap penguins

Visual 1 code:

ggplot(data = penguins) +

geom_point(mapping = aes(x=flipper_length_mm, y = body_mass_g,color=species))+

facet_grid(~sex)

Visual 2 code:

ggplot(data = penguins) +

geom_point(mapping = aes(x=flipper_length_mm, y = body_mass_g,color=species))+

facet_wrap(~species)

Visual 3 code:

ggplot(data = penguins) +

geom_smooth(mapping = aes(x=flipper_length_mm, y = body_mass_g)) +

geom_point(mapping = aes(x=flipper_length_mm, y = body_mass_g,color = species))

Project 2 - Diamonds

Information: This data is from posit cloud and is called "diamonds".

Visual 1 Code:

ggplot(data = diamonds)+

geom_bar(mapping=aes(x=cut,fill=clarity))

Visual 2 code:

ggplot(data = diamonds)+

geom_bar(mapping=aes(x=color,fill=cut))+

facet_wrap(~cut)

BACK TO TOP

Analysis Presentation

R Analysis

R Cleaning and Analysis

Project 1 - Palmer Penguins

Information: This data is from posit cloud and is called "palmerpenguins".

1. Rename the island column new_island

Code Chunk:

penguins %>%

rename(island_new=island)

2. Change all the column names to uppercase

Code Chunk:

rename_with(penguins,toupper)

3. Add a row at the end to calculate kgs for body mass and meters for body length

Code Chunk:

penguins %>%

mutate(body_mass_kg=body_mass_g/1000, flipper_length_m=flipper_length_mm/1000)

4. Arrange the data by bill length smallest to largest

Code Chunk:

penguins %>% arrange(-bill_length_mm)

5. Show the average bill length for each island

Code Chunk:

penguins %>% group_by(island) %>% drop_na() %>% summarize (mean_bill_length_mm = mean (bill_length_mm))

Result:

island mean_bill_length_mm

1 Biscoe 45.2

2 Dream 44.2

3 Torgersen 39.0

6. Show max and min bill length for each island

Code Chunk for max:

penguins %>% group_by(island) %>% drop_na() %>% summarise(max_bill_length_mm = max(bill_length_mm))

Result:

island max_bill_length_mm

1 Biscoe 59.6

2 Dream 58

3 Torgersen 46

Code Chunk for min:

penguins %>% group_by(island) %>% drop_na() %>% summarise(min_bill_length_mm = min(bill_length_mm))

Result

island min_bill_length_mm

1 Biscoe 34.5

2 Dream 32.1

3 Torgersen 33.5

7. Show to average bill length for each species on each island

Code Chunk:

penguins %>% group_by(species, island) %>% drop_na() %>% summarise(max_bl=max(bill_length_mm), mean_bl = mean(bill_length_mm))

Result:

species island max_bl mean_bl

1 Adelie Biscoe 45.6 39.0

2 Adelie Dream 44.1 38.5

3 Adelie Torgersen 46 39.0

4 Chinstrap Dream 58 48.8

5 Gentoo Biscoe 59.6 47.6

Analysis Presentation

BACK TO TOP

Projects

Data Cleaning SQL- BigQuery Project 1

Data Cleaning in SQL- BigQuery Project 2

Data Analysis SQL- BigQuery Project 1 - Getting to know the data with sorting and filtering

Data Cleaning SQL- BigQuery
Project 1

Data Cleaning in SQL- BigQuery
Project 2

Data Analysis SQL- BigQuery
Project 1 - Getting to know the data with sorting and filtering