SLB boxscore data scrape

SLB boxscore data scrape#

In this Chapter our aim is to produce comma-separated values files containing the box scores for games that have been played in the Women’s Super League Basketball. It will be conducted in two main steps:

Scrape the data we need to access the web page for each game’s box score.
Scrape the box score data for each game and save it as a CSV.

As we are accessing data from Javascript heavy websites we will use the selenium package to automate control of a web browser. You can find more information on this in the Chapter on Scraping Javascript heavy websites. If you have not used selenium with Chrome before, you will probably need to download and install Chrome for Testing.

We will need to run the same process for each game in terms of scraping the boxscores, so we will define Python functions that we can call repeatedly.

Scraping the URL data we need#

The Women’s Super League Basketball website has a livestats page that contains all of the game final scores (and current scores for any games in progress). There is also a link to more information for each game, and it is those links that we are trying to obtain.

As usual, the first thing we will do is import the Python libraries and packages that we will need.

from datetime import datetime 
from pathlib import Path 
import requests 
import time 
import pandas as pd 
import numpy as np 
import re
from bs4 import BeautifulSoup
from selenium import webdriver

The next code cell opens the livestats URL using the Chrome webbrowser (a new window should open if you run the code), the necessary HTML is placed into “soup”, and the browser window should automatically close.

livestats_url = "https://www.superleaguebasketballw.co.uk/livestats/"
driver = webdriver.Chrome()
page = driver.get(livestats_url)
driver.switch_to.frame(1)
results_soup = BeautifulSoup(driver.page_source, 'html')
driver.close()

All of the pertinent information is held in a HTML table (if necessary, you can download and inspect the website HTML to verify this), and games which have not yet been played have the text “Upcoming” rather than a score. The following piece of code extracts the game URLs as a list that we can process further.

Note: if we wanted to extract only the information for certain competitions (such as the Championship), we can add this a form of filter. The code which has been commented out will help with that.

tables = results_soup.find_all('table')
#comp_string = 'Championship 2024-25'
upcoming_string = 'Upcoming'
url_soup = []
for rows in tables[0].find_all('tr'):
    cells = rows.find_all('td')
#    if comp_string in cells[2]:
    for a in rows.find_all('a', href=True):
        if a.contents[0] != upcoming_string:
            url_soup.append(a['href'])

Let’s check how many game URLs we have extracted.

len(url_soup)

Rather than taking all 17 games, we can take just the first 10 by taking a slice of the list (this reduces the time taken to run all of the code, if you want information on more games then this can be easily modified. We can also take a peek at what the list of URLs looks like.

We can use the SLB URLs gethered above to directly access the FIBA live stats. This next cell makes a list of the live stats URLs for all of the games identified above.

url_soup = url_soup[:10]
url_soup

['http://www.fibalivestats.com/webcast/wbbl/2522426/',
 'http://www.fibalivestats.com/webcast/wbbl/2522423/',
 'http://www.fibalivestats.com/webcast/wbbl/2522421/',
 'http://www.fibalivestats.com/webcast/wbbl/2522427/',
 'http://www.fibalivestats.com/webcast/wbbl/2522424/',
 'http://www.fibalivestats.com/webcast/wbbl/2522419/',
 'http://www.fibalivestats.com/webcast/wbbl/2522422/',
 'http://www.fibalivestats.com/webcast/wbbl/2522417/',
 'http://www.fibalivestats.com/webcast/wbbl/2538198/',
 'http://www.fibalivestats.com/webcast/wbbl/2522428/']

We can now assemble the URL required to access the box scores and check that the website exists (more technically, than the URL resolves).

game_id = []
for url in url_soup:
    game_id.append(url.split('/')[5])

league = 'WBBL'
baseurl = 'https://www.fibalivestats.com/u/{}'.format(league)
games = []

for g_id in game_id:
    url = "{}/{}/".format(baseurl, g_id)
    resp = requests.get(url)
    if resp.status_code == 200:
        #print(url)
        games.append(url)
    else:
        print("Couldn't resolve URL:", url)

The URLs we will access can be inspected.

games

['https://www.fibalivestats.com/u/WBBL/2522426/',
 'https://www.fibalivestats.com/u/WBBL/2522423/',
 'https://www.fibalivestats.com/u/WBBL/2522421/',
 'https://www.fibalivestats.com/u/WBBL/2522427/',
 'https://www.fibalivestats.com/u/WBBL/2522424/',
 'https://www.fibalivestats.com/u/WBBL/2522419/',
 'https://www.fibalivestats.com/u/WBBL/2522422/',
 'https://www.fibalivestats.com/u/WBBL/2522417/',
 'https://www.fibalivestats.com/u/WBBL/2538198/',
 'https://www.fibalivestats.com/u/WBBL/2522428/']

Exercise#

After inspecting the livestats page in a web browser to help you decide what to do, modify the code so that you only obtain the URLs for the first four games played in the 2024-25 version of the Betty Codona Trophy.

Scraping the box score data#

Now that we have the URLs, we need to scrape the box score for each one of those. As we are essentially repeating the same exercise multiple times, we will define some functions that we will repeat for each URL.

The first function is one that takes a URL soup object as input, creates some internal lists of data, then returns the data as a dataframe. To make creating filenames for CSV files easier, the same function also returns the names of the teams that played and the date the game was played.

As we are just defining the function, running the next cell of code doesn’t actually do anything at this point. Note, that this function should work for any FIBA livestats page, not just Women’s SLB - assuming that the page contains all of the stats we are looking for.

def stats_to_df(soup):
    """Converts the soup of FIBA livestats for a single game into a data frame. 
    The data frame, the teams playing and the date the game was played are then returned"""
    teams=[]
    team_divs = soup.find_all("div", {"class": "team-name"})
    for count, div in enumerate(team_divs):
        team_span = div.find_all('span')
        teams.append(team_span[0].get_text())
    
    date=soup.find_all("div", {"class": "og-date"})[0].get_text()
    date_formatted = datetime.strptime(date, '%d/%m/%Y')
    date = date_formatted.strftime('%Y%m%d')

    #Create the internal lists to hold all of the data
    player_name=[]
    team=[]
    minutes=[]
    points=[]
    fgm=[]
    fga=[]
    fgper=[]
    twopm=[]
    twopa=[]
    twoper=[]
    threepm=[]
    threepa=[]
    threeper=[]
    ftpm=[]
    ftpa=[]
    ftper=[]
    rebo=[]
    rebd=[]
    rebtot=[]
    assists=[]
    tos=[]
    steals=[]
    blocks=[]
    blocksr=[]
    fouls=[]
    foulson=[]
    plusminus=[]

    #Populate the lists
    scores_tables = soup.find_all("table", {"class": "boxscore"})
    for team_count, table in enumerate(scores_tables):
        for count, row in enumerate(table.find_all('tr', {"class": "player-row"})):
            if count != 0:
                player_name.append(row.find_all('a', {"class": "playerpopup"})[0].find_all('span')[0].get_text())
                team.append(teams[team_count])
                minutes.append(row.find_all('span', {"id": re.compile("Minutes")})[0].get_text())
                points.append(row.find_all('span', {"id": re.compile("Points")})[0].get_text())
                fgm.append(row.find_all('span', {"id": re.compile("FieldGoalsMade")})[0].get_text())
                fga.append(row.find_all('span', {"id": re.compile("FieldGoalsAttempted")})[0].get_text())
                fgper.append(row.find_all('span', {"id": re.compile("FieldGoalsPercentage")})[0].get_text())
                twopm.append(row.find_all('span', {"id": re.compile("TwoPointersMade")})[0].get_text())
                twopa.append(row.find_all('span', {"id": re.compile("TwoPointersAttempted")})[0].get_text())
                twoper.append(row.find_all('span', {"id": re.compile("TwoPointersPercentage")})[0].get_text())
                threepm.append(row.find_all('span', {"id": re.compile("ThreePointersMade")})[0].get_text())
                threepa.append(row.find_all('span', {"id": re.compile("ThreePointersAttempted")})[0].get_text())
                threeper.append(row.find_all('span', {"id": re.compile("ThreePointersPercentage")})[0].get_text())
                ftpm.append(row.find_all('span', {"id": re.compile("FreeThrowsMade")})[0].get_text())
                ftpa.append(row.find_all('span', {"id": re.compile("FreeThrowsAttempted")})[0].get_text())
                ftper.append(row.find_all('span', {"id": re.compile("FreeThrowsPercentage")})[0].get_text())
                rebo.append(row.find_all('span', {"id": re.compile("ReboundsOffensive")})[0].get_text())
                rebd.append(row.find_all('span', {"id": re.compile("ReboundsDefensive")})[0].get_text())
                rebtot.append(row.find_all('span', {"id": re.compile("ReboundsTotal")})[0].get_text())
                assists.append(row.find_all('span', {"id": re.compile("Assists")})[0].get_text())
                tos.append(row.find_all('span', {"id": re.compile("Turnovers")})[0].get_text())
                steals.append(row.find_all('span', {"id": re.compile("Steals")})[0].get_text())
                blocks.append(row.find_all('span', {"id": re.compile("Blocks")})[0].get_text())
                blocksr.append(row.find_all('span', {"id": re.compile("BlocksReceived")})[0].get_text())
                fouls.append(row.find_all('span', {"id": re.compile("FoulsPersonal")})[0].get_text())
                foulson.append(row.find_all('span', {"id": re.compile("FoulsOn")})[0].get_text())
                plusminus.append(row.find_all('span', {"id": re.compile("PlusMinusPoints")})[0].get_text())

    #Create the dataframe
    df = pd.DataFrame(np.column_stack([player_name, team, minutes, points, fgm, fga, fgper, twopm, twopa, twoper, threepm, threepa, threeper, 
                                   ftpm, ftpa, ftper, rebo, rebd, rebtot, assists, tos, steals, blocks, blocksr, fouls, foulson, plusminus]), 
                                   columns=["Name", "Team", "Mins", "PTS", "FGM", "FGA", "FG%", "2PM", "2PA", "2P%", "3PM", "3PA", "3P%", 
                                    "FTM", "FTA", "FT%","OREB", "DREB", "REB", "AST", "TO", "STL", "BLK", "BLKR", "PF", 
                                    "FOULON", "PLUSMINUS"])
    
    return df, teams, date

This next function takes a game URL from above, adds the extra part to the URL required to access the boxscore, then uses Selenium and Chrome to scrape the page into BeautifulSoup.

Note: I ask the web browser to wait (sleep) for 2 seconds after opening the page as sometimes it can take a while for all of the Javascript to load. It should be possible to do this with Selenium, but I have never been able to get it to work correctly with the FIBA pages.

def fiba_url_to_soup(game):
    """Takes the base FIBA livestats URL, adds the extra info to request the boxscore, then returns the pagesoup"""
    url = game+'bs.html'
    browser = webdriver.Chrome()
    browser.get(url)
    time.sleep(2)
    soup = BeautifulSoup(browser.page_source, 'html')
    browser.close()

    return soup

We also need a function to export the dataframe as a CSV file.

The filename for the CSV is automatically assembled from the team names and date data we extracted earlier. We can optionally also ask for the CSV files to be saved into a specific directory/folder.

def save_game_csv(df, teams, date, directory=None):
    """Saves the dataframe in CSV format, with the filename generated from the teams and date
    Optionally places the file into a directory"""
    filename = teams[0].replace(" ", "-") + "-Vs-" + teams[1].replace(" ", "-") + "-" + date + ".csv"
    if directory == None:
        df.to_csv(filename)
    else:
        if not Path(directory).is_dir():
            Path(directory).mkdir()
        df.to_csv(Path(directory, filename))

    return

With the functions defined, we now need to run all three of them for each of the games we identified above. Please note that it will launch a Chrome window (and eventually close it) for every game, so it can take a while if you’re asking for a lot of games to be scraped.

The CSV files should be saved in a directory called “data”.

for game in games:
    soup = fiba_url_to_soup(game)
    df, teams, date = stats_to_df(soup)
    save_game_csv(df, teams, date, "data")

We should now have a separate CSV for each game that has been played, all in the directory called data. These CSV files can be opened in most spreadsheet software, or we can read it in to Pandas dataframes for further processing.

Exercise#

Modify the code so that it downloads the boxscores for games involving just one team from the Men’s SLB.

SLB boxscore data scrape

Contents

SLB boxscore data scrape#

Scraping the URL data we need#

Exercise#

Scraping the box score data#

Exercise#