top of page
Data Clean SQL

Data Cleaning SQL- BigQuery 
Project 1

Information: This is a customer data set with customer names, IDs, addresses, cities, states, zip codes, and countries. This Data set was given to me through the Google Data Analytics Professional Certification Course.  

Steps Taken:

1. Downloaded dataset into SQL BigQuery

 

2. I was asked to add a customer to the dataset and I added a row of data using the INSERT INTO and VALUES functions in this query:

INSERT INTO 

  `my-first-project-sql-363013.customer_data.customer_adress`

  (customer_id, address, city, state, zipcode, country)

VALUES

  (2645,'3456 Highfield Court', 'Jackson', 'MI',49202, "US")

 

3. I was asked to update a current customer address so I updated it using this query

UPDATE

  `my-first-project-sql-363013.customer_data.customer_adress`

SET

  address = '2346 New Haven Ct'

WHERE

  customer_id = 4957

 

4. I wrote this next query to find out if there were any mistakes in the data and found some rows were showing a LENGTH of 3 and not 2  in the country row

SELECT

   LENGTH(country) AS letters_in_country

FROM

   `my-first-project-sql-363013.customer_data.customer_adress`

 

5. I wrote this query to find the mistake where LENGTH was 3 and not 2 

SELECT

   country

FROM

   `my-first-project-sql-363013.customer_data.customer_adress`

WHERE

   LENGTH(country) > 2

 

6. I found that someone imputed ‘USA’ Instead of ‘US’ in this query ​

SELECT

   country

FROM

   `my-first-project-sql-363013.customer_data.customer_adress`

WHERE

   LENGTH(country) > 2

 

7. In order to view the customer_id with ‘US’ and ‘USA’ in the country column I used the SUBSTRING function and the DISTINCT function because there are duplicate customer_id  in this query:

SELECT

    DISTINCT customer_id

FROM

    `customer_data.customer_adress`

WHERE

    SUBSTRING(country, 1, 2) = "US"

 

8. I then write this query to find out if the same mistake has been made in the state column 

SELECT

   state

FROM

   `my-first-project-sql-363013.customer_data.customer_adress`

WHERE

   LENGTH(state) > 2

9. I find out that one OH has extra whitespace behind it and I use the trim function in the query to trim the whitespace​

SELECT 

    DISTINCT customer_id

FROM  

    `customer_data.customer_adress`

WHERE

    TRIM(state)='OH'

 

Information: This is a data set with information on the number of doors, fuel type, horsepower, etc. 

Description of data: https://archive.ics.uci.edu/ml/datasets/Automobile 

 

1. Download data onto BigQuery 

2. Inspected data to make sure it matched the description using this query 

SELECT

    DISTINCT fuel_type

FROM

    Cars.car_info

3. I found it matched the description so I moved on to see if the min and max length are the same as the description using this query 

SELECT

    MIN(length) AS min_length,

    MAX(length) AS max_length

FROM

    Cars.car_info

 

4. I then look for ‘null’ values in the num_of_doors 

SELECT

    *

FROM

    Cars.car_info

WHERE

     num_of_doors IS NULL

 

5.  I find that there are 2 null values and checked with stakeholders and found that all the null values should be say “four” so I write this query to fill in the null

UPDATE

     Cars.car_info

SET

    num_of_doors = "four"

WHERE

    make = "dodge"

    AND fuel_type = "gas"

    AND body_style = "sedan"

 

6. I do further inspection and use this query

 

SELECT

    DISTINCT num_of_cylinders

FROM

     Cars.car_info

 

7. I find out that instead of ‘two’ one of the cells says ‘tow’ so I write this query to change it

UPDATE

     Cars.car_info

SET

    num_of_cylinders = "two"

WHERE

    num_of_cylinders = "tow"

 

8. I now inspect the compression ratio column the see if the range is between 7 and 23 like the data description says so I run this query 

SELECT

    MIN(compression_ratio) AS min_compression_ratio,

    MAX(compression_ratio) AS max_compression_ratio

FROM

    Cars.car_info

9. I find out that the max is 70 and this should be deleted (after I consult the stakeholder and manager) so I write this query to remove it. 

DELETE  

     Cars.car_info

WHERE

     compression_ratio = 70

 

10. I then check the drive_wheels column to check for consistency using this query:

SELECT

    DISTINCT drive_wheels

FROM

    Cars.car_info

11. I find that 4wd shows up twice as a distinct input so I write this query to find out why

SELECT

    DISTINCT drive_wheels,

    LENGTH(drive_wheels) AS string_length

FROM

     Cars.car_info

 

12. I find out that multiple of the 4wd has 4 lengths so I write this query to fix that 

UPDATE

     Cars.car_info

SET

    drive_wheels = TRIM(drive_wheels)

WHERE TRUE

Data Cleaning in SQL- BigQuery 
Project 2

Data Analysis SQL- BigQuery
Project 1 - Getting to know the data with sorting and filtering

Information: This is data from the Google Data Analytics coursework. It is a list of movies, release dates, revenue, etc. 

 

1. Filter the data to only show the comedy Genre that made more than $500,000,000

 

SELECT

    Movie_Title, Genre, Revenue `movie_data.movies`

FROM

    `movie_data.movies`

WHERE

    Genre = 'Comedy'

    AND

    Revenue>500000000

 

2. Sort the data by release date ( DECS) and filter the to only show the comedy genre. 

 

SELECT

    *

FROM

    `movie_data.movies`

WHERE

    Genre = 'Comedy'

ORDER BY

    Release_Date

 DESC

 

3. What was the average revenue in the comedy genre?

 

SELECT

    AVG(Revenue) AS average_comedy_rev

FROM

    `my-first-project-sql-363013.movie_data.movies`

WHERE

    Genre = "Comedy"



 

4. What genre had the highest average revenue?

 

SELECT

    Genre,

    AVG(Revenue) AS average_rev

FROM

    `my-first-project-sql-363013.movie_data.movies`

GROUP BY

    Genre

ORDER BY

    average_rev

 DESC

Here is the result:

 

Movies revenue.png
Analysis SQL

Project 2 - Using JOINs and subqueries 

Information: This data is from a bike share app, from BigQuery, I will show the questions the Stakeholders asked and then what query I wrote to find the answer.

 

1. Compare the average number of bikes available at each station to the average number of bikes available at individual stations. 

 

SELECT

station_id,

num_bikes_available,

   (

     SELECT

         AVG(num_bikes_available)

         FROM

           `bigquery-public-data.new_york.citibike_stations`

    )

     AS avg_num_bikes_available

FROM

    bigquery-public-data.new_york.citibike_stations

 

This is the result:

 

2. Find the start stations with the most number of rides.

 

SELECT

    station_id,

    name,

    num_of_rides AS number_of_rides_at_station

 FROM

    (

        SELECT

             start_station_id,

             COUNT(*) num_of_rides

        FROM

            `bigquery-public-data.new_york.citibike_trips`

        GROUP BY

             start_station_id

    )

    AS station_num_trips

 INNER JOIN

 `bigquery-public-data.new_york.citibike_stations` ON station_id = start_station_id

 ORDER BY

   station_num_trips.num_of_rides DESC

This is the result:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3. What were the 10 most popular trips (start location to end location)  and what was the average time of those trips? 

 

SELECT

     CONCAT (start_station_name,"to", end_station_name) AS route,

     COUNT (*)  AS num_trips,

     ROUND (AVG (CAST (tripduration AS int64) / 60) ,2) AS duration

FROM

      `bigquery-public-data.new_york.citibike_trips`

GROUP BY

     start_station_name,end_station_name

ORDER BY

     num_trips DESC

LIMIT

     10

This is the result:

AVG bikes compared.png
Start station most rides .png
most popular trips .png
Tableau Viz

Tableau Visualizations

 

Project 1 - Electric Cars in Washington State

Information: This data is from data.gov and it shows the Battery Electric Vehicles (BEVs) and Plugin Hybrid Electric Vehicles (PHEVs) that are currently registered through the Washington State Department of Licensing (DOL). 

Key insights:

1. The Jaguar has the highest Electric Range out of any of the other electric cars with a top range of 212.1 miles

2. The clear winning of the most popular electric car in Washington State is Tesla (47,264), followed by Nissan (12,909), and then Chevrolet (9,846).

3. The highest population of electric cars in Washington State is in Seattle.

4. The most popular Tesla in Washinton State is the Model 3.

Click on the picture to open up Tableau and interact with the Visualization

Project 2 - World Happiness

Information: This data is from the Google Data Analytics coursework. The data looks at elements of happiness and quantifies them. I took the data and made visualizations that compared the elements to find out what really made people happy as well as looked at European happiness as well. 

Key Insights:  

1. The top 4 things that influence a country's happiness are; GDP, family, freedom, and life-expectancy

2. Europe, overall, was a happier nation and had higher correlations between family and happiness, life expectancy and happiness, and GDP and happiness than compared to the rest of the world. 

3. The happiest country according to the data was a 4-way tie with; Norway (7.5), Denmark (7.5), Iceland (7.5), and Switzerland (7.5). 

Click on the picture to open up Tableau and interact with the Visualization.

Project 3- Movies

Information: This data is from the Google Data Analytics coursework. The data includes revenue, budget, movie titles, etc. 

Key Insights:  

1. The 3 most popular Genres of movies made were #1 Action, #2 Comedy, and #3 Drama.

2. The most profitable Genres were #1 Adventure, #2 Family, and #3 SciFi. Where Action, Comedy, and Drama come in at #6 (Action), #9 (Comedy), and #12 (Drama).

3. The most profitable move was Despicable Me 2 at $894,800,000 

Screen Shot 2022-12-08 at 2.51.53 PM.png
Screen Shot 2022-12-08 at 3.06.42 PM.png
Screen Shot 2022-12-16 at 12.59.44 PM.png
R Viz

Visualizations

 

Project 1 - Palmer Penguins

Information: This data is from posit cloud and is called "palmerpenguins". 

Key insights: 

  1. Male penguins are larger on average 

  2. Gentoos are the largest penguins 

  3. As flipper length goes up, so does body mass 

  4. Adelie penguins are more spread out in size than the chinstrap penguins 

Visual 1 code:

ggplot(data = penguins) + 

  geom_point(mapping = aes(x=flipper_length_mm, y = body_mass_g,color=species))+

  facet_grid(~sex)

 

 

 

 

 

 

 

 

 

 

 

Visual 2 code: 

ggplot(data = penguins) + 

  geom_point(mapping = aes(x=flipper_length_mm, y = body_mass_g,color=species))+

  facet_wrap(~species)

 

 

 

 

 

 

 

 

 

 

 

 

 

Visual 3 code:

ggplot(data = penguins) + 

  geom_smooth(mapping = aes(x=flipper_length_mm, y = body_mass_g)) + 

  geom_point(mapping = aes(x=flipper_length_mm, y = body_mass_g,color = species))

Project 2 - Diamonds 

Information: This data is from posit cloud and is called "diamonds". 

Visual 1 Code:

ggplot(data = diamonds)+

  geom_bar(mapping=aes(x=cut,fill=clarity))

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Visual 2 code:

ggplot(data = diamonds)+

  geom_bar(mapping=aes(x=color,fill=cut))+

  facet_wrap(~cut)

 

 

 

Flipper body male female.png
Species penguins.png
Geom smooth color species.png
diaomds .png
lots of charts diamonds.png
R Analysis

Cleaning and Analysis 

 

Project 1 - Palmer Penguins

Information: This data is from posit cloud and is called "palmerpenguins". 

1. Rename the island column new_island 

Code Chunk:

     penguins %>% 

        rename(island_new=island)

 

2. Change all the column names to uppercase

Code Chunk:

     rename_with(penguins,toupper)

 

3. Add a row at the end to calculate kgs for body mass and meters for body length 

Code Chunk:

     penguins %>% 

       mutate(body_mass_kg=body_mass_g/1000, flipper_length_m=flipper_length_mm/1000)

4. Arrange the data by bill length smallest to largest 

 

Code Chunk:

   penguins %>%  arrange(-bill_length_mm)

5. Show the average bill length for each island

 

Code Chunk:

     penguins %>%  group_by(island) %>%  drop_na() %>%  summarize (mean_bill_length_mm = mean (bill_length_mm))

Result:

     island    mean_bill_length_mm

       <fct>                   <dbl>

     1 Biscoe                   45.2

     2 Dream                   44.2

     3 Torgersen              39.0

6. Show max and min bill length for each island 

Code Chunk for max:

     penguins %>% group_by(island) %>%  drop_na() %>%  summarise(max_bill_length_mm = max(bill_length_mm))

Result:

       island    max_bill_length_mm

       <fct>                  <dbl>

     1 Biscoe                  59.6

     2 Dream                  58  

     3 Torgersen             46 

Code Chunk for min:

penguins %>% group_by(island) %>%  drop_na() %>%  summarise(min_bill_length_mm = min(bill_length_mm))

Result

       island    min_bill_length_mm

       <fct>                  <dbl>

     1 Biscoe                  34.5

     2 Dream                  32.1

     3 Torgersen             33.5

7. Show to average bill length for each species on each island

 

Code Chunk:

     penguins %>%  group_by(species, island) %>%  drop_na() %>%  summarise(max_bl=max(bill_length_mm), mean_bl =             mean(bill_length_mm))

Result:

     species   island    max_bl mean_bl

       <fct>     <fct>      <dbl>   <dbl>

     1 Adelie    Biscoe      45.6     39.0

     2 Adelie    Dream       44.1    38.5

     3 Adelie    Torgersen  46       39.0

     4 Chinstrap Dream     58       48.8

     5 Gentoo    Biscoe     59.6     47.6

bottom of page