Scraping Javascript heavy websites

Scraping Javascript heavy websites#

Unlike BritHoops, some websites make extensive use of Javascript for additional functionality, making it difficult to use the “requests” library to extract the data we want.

In this Chapter, our aim is to extract the results of the games played this far in the Men’s SLB, which can be found on the livestats page. Once we have the data we want, we will process it and produce a table of the results from the Trophy competition.

Note: The actual SLB Trophy competition was divided into North and South pools, so a combined table isn't reflective of the actual competition. However, one of the beauties of scraping and processing data is that we can produce some fictional "what-ifs" to explore.

Here’s a cropped screenshot of the livestats page so that you can see what it looks like:

Screenshot of the livestats page

As you can see, this data is almost certainly found in a table somewhere within the HTML.

Accessing the correct HTML data#

As a first attempt we can try scraping the HTML from livestats with requests and Beautiful Soup, in much the same way as we did for our First Foray into Data Scraping.

import requests
from bs4 import BeautifulSoup

Due to the way that the livestats page serves up its HTML, we need to pass an extra option to requests too - this means it will ignore any untrusted SSL certificates. This is a minor security risk, so should be avoided where possible and you are likely to recieve a Python warning when running it.

livestats_url = 'https://www.superleaguebasketballm.co.uk/livestats/'
page = requests.get(livestats_url, verify=False)
soup = BeautifulSoup(page.content)

/Users/ch1jgh/miniforge3/envs/bball-analytics/lib/python3.12/site-packages/urllib3/connectionpool.py:1099: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.superleaguebasketballm.co.uk'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings
  warnings.warn(

If you save the HTML soup to a file and inspect it, you’ll see it is much more complex that what we saw with BritHoops. We can persist with this and try to extract any HTML tables from the soup.

tables = soup.find_all('table')
len(tables)

Hmm, it appears that are no HTML tables available in the soup we scraped. One of the reasons for this is that the webpage is using iframes, which makes grabbing the data more difficult as it’s effectively a webpage within a webpage.

We are going to change our approach here and automate the use of a web browser to extract the information we want. The selenium Python package allows us to do this with a variety of browsers, but I am going to use the Google Chrome browser, along with the Chrome for testing program.

The next code cell imports what we need from selenium, asks it to use the Chrome browser, opens the livestats URL (a Chrome window should pop up if you run this code), switches to the second iframe on the webpage, scrapes the HTML from said iframe, then closes the broswer window.

from selenium import webdriver

driver = webdriver.Chrome()
page = driver.get(livestats_url)
driver.switch_to.frame(1)
results_soup = BeautifulSoup(driver.page_source, 'html')
driver.close()

Let’s try extracting any HTML tables again, and see how many we get.

tables = results_soup.find_all('table')
len(tables)

Excellent! We’ve managed to extract a table this time. To check it contains the information we want, we can extract the first row from the table and see what it look like.

tables[0].find('tr')

<tr class="sp_lsmatch finished spmatcheven"><td class="sp_dtfield">                                                    27/09/2024                                                                            </td><td class="sp_tmfield">7:30 pm</td><td>Trophy 2024-25</td><td class="spteam1 tid178239"><span class="team1name">Caledonia Gladiators</span><span class="logo"><img src="https://images.statsengine.playbyplay.api.geniussports.com/706553f85ba846cfbcae48cf69f1c316T1.png"/></span></td><td class="sp_status">Final</td><td class="spteam2 tid178234"><span class="logo"><img src="https://images.statsengine.playbyplay.api.geniussports.com/516edf4c0fb58d77097883a0cf4966c3T1.png"/></span><span class="team1name">Manchester Basketball</span></td><td class="spscore"><a href="http://www.fibalivestats.com/webcast/SLB/2523184/" style="text-decoration:none;" target="_blank">99 - 77</a></td></tr>

This definitely contains what we want - it’s the result of the Gladiators vs Manchester game from 27th Sept 2024 (Glads won 99-77).

Extracting the information we want from the HTML#

The information we want is held in the object above, but to hold the information we will need to construct a table we can create some Python dictionaries to contain the numbers of wins and losses, and the points for and against.

wins = {"B. Braun Sheffield Sharks": 0, "Bristol Flyers": 0, "Caledonia Gladiators": 0, "Cheshire Phoenix": 0, 
        "Leicester Riders": 0, "London Lions": 0, "Manchester Basketball": 0, "Newcastle Eagles": 0, "Surrey 89ers": 0}

losses = {"B. Braun Sheffield Sharks": 0, "Bristol Flyers": 0, "Caledonia Gladiators": 0, "Cheshire Phoenix": 0, 
        "Leicester Riders": 0, "London Lions": 0, "Manchester Basketball": 0, "Newcastle Eagles": 0, "Surrey 89ers": 0}

points_for = {"B. Braun Sheffield Sharks": 0, "Bristol Flyers": 0, "Caledonia Gladiators": 0, "Cheshire Phoenix": 0, 
        "Leicester Riders": 0, "London Lions": 0, "Manchester Basketball": 0, "Newcastle Eagles": 0, "Surrey 89ers": 0}

points_allowed = {"B. Braun Sheffield Sharks": 0, "Bristol Flyers": 0, "Caledonia Gladiators": 0, "Cheshire Phoenix": 0, 
        "Leicester Riders": 0, "London Lions": 0, "Manchester Basketball": 0, "Newcastle Eagles": 0, "Surrey 89ers": 0}

Next we will extract all of the table rows, and define a string that we will use to filter only the Trophy results from the full set of games.

results_rows = tables[0].find_all('tr')
trophy_string = "Trophy 2024-25"

Now it is time to extract and process the results into the dictionary. To do that, we will loop over all of the table rows, check if it was a Trophy game, and grab the information we want.

for entry in results_rows:
    row_cells = entry.findAll('td')
    if row_cells[2].get_text() == trophy_string:
        home_team = row_cells[3].get_text()
        away_team = row_cells[5].get_text()
        score = row_cells[6].get_text()
        if score != "Upcoming":
            score = score.split(" - ")
            home_allowed = int(score[1])
            away_allowed = int(score[0])
            home_for = away_allowed
            away_for = home_allowed
            points_allowed[home_team] = points_allowed[home_team] + home_allowed
            points_allowed[away_team] = points_allowed[away_team] + away_allowed
            points_for[home_team] = points_for[home_team] + home_for
            points_for[away_team] = points_for[away_team] + away_for
            if home_for > away_for:
                # Home team won
                wins[home_team] = wins[home_team] + 1
                losses[away_team] = losses[away_team] + 1
            else:
                losses[home_team] = losses[home_team] + 1
                wins[away_team] = wins[away_team] + 1

Let’s have a quick peek inside the “wins” dictionary to check if we have something sensible.

wins

{'B. Braun Sheffield Sharks': 4,
 'Bristol Flyers': 3,
 'Caledonia Gladiators': 2,
 'Cheshire Phoenix': 6,
 'Leicester Riders': 5,
 'London Lions': 2,
 'Manchester Basketball': 0,
 'Newcastle Eagles': 5,
 'Surrey 89ers': 1}

Now that we’ve got the data, it’s time to place it into lists and create a dataframe. Let’s start by importing pandas.

import pandas as pd

Then the lists and dataframe

teams = list(wins.keys())
standings_df = pd.DataFrame()
standings_df["Team"] = teams

wins_list = list(wins.values())
losses_list = list(losses.values())
points_for_list = list(points_for.values())
points_allowed_list = list(points_allowed.values())

standings_df["W"], standings_df["L"], standings_df["PPG"], standings_df["OPPG"]  = [wins_list, losses_list, points_for_list, 
                                                                                     points_allowed_list]

We can take a look at the information that’s gone into the dataframe.

standings_df

	Team	W	L	PPG	OPPG
0	B. Braun Sheffield Sharks	4	3	603	578
1	Bristol Flyers	3	3	521	505
2	Caledonia Gladiators	2	5	573	601
3	Cheshire Phoenix	6	1	664	570
4	Leicester Riders	5	0	461	387
5	London Lions	2	3	393	418
6	Manchester Basketball	0	6	468	564
7	Newcastle Eagles	5	2	557	552
8	Surrey 89ers	1	5	455	520

Create a table based on the results#

Use the information from the dataframe to calculate some new columns: number of games played (GP), the difference in points scored/points allowed (Diff), a number of points based on 2 points for a win (fictional for the Trophy competition), and a percentage of games won (Win%).

standings_df["GP"] = standings_df["W"] + standings_df["L"]
standings_df["Diff"] = standings_df["PPG"] - standings_df["OPPG"]
standings_df["Pts"] = standings_df["W"] * 2
standings_df["Win%"] = standings_df["W"] / standings_df["GP"]

The NBA often use a “games back” (GB) column in their standings tables, so we can define a series of functions to calculate it.

def w_games_behind(wins, wins_ref):
    w_diff = wins_ref - wins
    return w_diff/2

def l_games_behind(losses, losses_ref):
    l_diff = losses - losses_ref
    return l_diff/2

def games_back(w_games, l_games):
    g_back = w_games + l_games
    return g_back

We can then calculate how many “games back” each team, is after working out which team has the best Win%.

best_win_per = standings_df['Win%'].idxmax()
wins_ref = standings_df.iloc[best_win_per]['W']
losses_ref = standings_df.iloc[best_win_per]['L']
standings_df['wgb'] = standings_df.apply(lambda x: w_games_behind(x['W'], wins_ref), axis=1)
standings_df['lgb'] = standings_df.apply(lambda x: l_games_behind(x['L'], losses_ref), axis=1)
standings_df['GB'] = standings_df.apply(lambda x: games_back(x['wgb'], x['lgb']), axis=1)
# Replace the 0.0 games back for the top team with --
standings_df['GB'].replace([0.0], ['--'], inplace=True)

Calling the dataframe will show us what data we have so far.

standings_df

	Team	W	L	PPG	OPPG	GP	Diff	Pts	Win%	wgb	lgb	GB
0	B. Braun Sheffield Sharks	4	3	603	578	7	25	8	0.571429	0.5	1.5	2.0
1	Bristol Flyers	3	3	521	505	6	16	6	0.500000	1.0	1.5	2.5
2	Caledonia Gladiators	2	5	573	601	7	-28	4	0.285714	1.5	2.5	4.0
3	Cheshire Phoenix	6	1	664	570	7	94	12	0.857143	-0.5	0.5	--
4	Leicester Riders	5	0	461	387	5	74	10	1.000000	0.0	0.0	--
5	London Lions	2	3	393	418	5	-25	4	0.400000	1.5	1.5	3.0
6	Manchester Basketball	0	6	468	564	6	-96	0	0.000000	2.5	3.0	5.5
7	Newcastle Eagles	5	2	557	552	7	5	10	0.714286	0.0	1.0	1.0
8	Surrey 89ers	1	5	455	520	6	-65	2	0.166667	2.0	2.5	4.5

It’s now time to turn this into a nice looking “league” table. We will define some functions to add colour and bold font based on whether a value is greater than or less than a threshold.

def colour_threshold_lessthan(value,threshold,colour = 'red'):
    if value < threshold:
        return f'color: {colour}'
    else:
        return ''
    
def colour_threshold_morethan(value,threshold,colour = 'green'):
    if value > threshold:
        return f'color: {colour}'
    else:
        return ''

# functions to change font weight based on a threshold    
def weight_threshold_lessthan(value,threshold):
    if value < threshold:
        return f'font-weight: bold'
    else:
        return ''

def weight_threshold_morethan(value,threshold):
    if value > threshold:
        return f'font-weight: bold'
    else:
        return ''

We will sort the table so that the team with the highest Win% goes first, then lowest number of games back, then use Diff to sort any teams that would be tied.

We will only print a subset of the stats that are contained in the dataframe, and apply some formating.

standings_df = standings_df.sort_values(by=["Win%", "GB", "Diff"], ascending=[False, True, False])
standings_df = standings_df[["Team", "GP", "W", "L", "PPG", "OPPG", "Diff", "Win%", "GB"]]
standings_df.style.format({'Win%':'{:.2}'}).set_caption('SLB ficticious Trophy standings').set_table_styles([{
    'selector': 'caption',
    'props': [
        ('text-align', 'left'),
        ('font-size', '18px'),
        ('font-weight', 'bold')
    ]
}]).map(colour_threshold_lessthan,threshold = 0,subset = ['Diff'])\
.map(weight_threshold_lessthan,threshold = 0,subset = ['Diff'])\
.map(colour_threshold_morethan,threshold = 0,subset = ['Diff'])\
.map(weight_threshold_morethan,threshold = 0,subset = ['Diff'])\
.hide(axis="index")

SLB ficticious Trophy standings
Team	GP	W	L	PPG	OPPG	Diff	Win%	GB
Leicester Riders	5	5	0	461	387	74	1.0	--
Cheshire Phoenix	7	6	1	664	570	94	0.86	--
Newcastle Eagles	7	5	2	557	552	5	0.71	1.000000
B. Braun Sheffield Sharks	7	4	3	603	578	25	0.57	2.000000
Bristol Flyers	6	3	3	521	505	16	0.5	2.500000
London Lions	5	2	3	393	418	-25	0.4	3.000000
Caledonia Gladiators	7	2	5	573	601	-28	0.29	4.000000
Surrey 89ers	6	1	5	455	520	-65	0.17	4.500000
Manchester Basketball	6	0	6	468	564	-96	0.0	5.500000

As a summary, in this Chapter we have used automated control of a web browser to extract HTML from a complicated website, processed the data into a dataframe, then produced a league standings table using some stats that we calculated based on game results.

Exercises#

To extend this, you could:

Modify the code to extract the data on a different set of games, for example, the SLB Championship.
Produce a different league table, which shows other sets of stats and is based on achieving two points for each game won.