Table Web Scraping Issues With Python

May 30, 2024 Post a Comment

I am having issues scraping data from this website: https://fantasy.premierleague.com/player-list I am interested in getting access to the player's names and points from the differ

Solution 1:

You can use to get all the table data webdriver, pandas and BeautifulSoup.

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import pandas as pd
url = "https://fantasy.premierleague.com/player-list"

driver = webdriver.Firefox()
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html)
table = soup.find_all('table', {'class': 'Table-ziussd-1 fVnGhl'})

df = pd.read_html(str(table))

print(df)

Output will be:

[             Player            Team  Points  Cost
0           Alisson       Liverpool      99  £6.21           Ederson        Man City      89  £6.02              Kepa         Chelsea      72  £5.43        Schmeichel       Leicester     122  £5.44            de Gea         Man Utd     105  £5.35            Lloris           Spurs      56  £5.36         Henderson   Sheffield Utd     135  £5.37          Pickford         Everton      93  £5.28          Patrício          Wolves     122  £5.29          Dubravka       Newcastle     124  £5.110             Leno         Arsenal     114  £5.011           Guaita  Crystal Palace     122  £5.012             Pope         Burnley     129  £4.913           Foster         Watford     113  £4.914        Fabianski        West Ham      61  £4.915        Caballero         Chelsea       7  £4.816             Ryan        Brighton     105  £4.717            Bravo        Man City      11  £4.718            Grant         Man Utd       0  £4.719           Romero         Man Utd       0  £4.620             Krul         Norwich      94  £4.621         Mignolet       Liverpool       0  £4.522         McCarthy     Southampton      74  £4.523         Ramsdale     Bournemouth      97  £4.524         Fahrmann         Norwich       1  £4.4and so on........................................]

Solution 2:

The table you want to scrape is generated using Javascript, which is not executed when you do html = urlopen(url) and thus not in the soup either. There are many methods as how to get dynamically generated data. Check here for example.

Solution 3:

https://fantasy.premierleague.com/player-list uses Javascript to generate data to html. BeautifulSoup cannot scrape Javascript so we need to emulate real browser to load data. To do this you can use Selenium - In below code I user Firefox but you can use Chrome for example. Please check Selenium's documentation on how to get it running.

Script opens Firefox browser, pauses for 1 second ( to make sure that all Javascript data has loaded) and passes html to BeautifulSoup. You might need to pip install lxml parser for script to run.

Then we look for all div', {'class':'Layout__Main-eg6k6r-1 cSyfD' as those contain all 4 tables on the website. You may want to use Inspect Element tool in your browser to check names of tables, div's to target your search.

Then you can call any of 4 divs and search for tr in each.

from selenium import webdriver
import time
from bs4 import BeautifulSoup 

browser = webdriver.Firefox()
browser.set_window_size(700,900)

url = 'https://fantasy.premierleague.com/player-list'

browser.get(url)
time.sleep(1)

html = browser.execute_script('return document.documentElement.outerHTML')


all_html = BeautifulSoup(html,'lxml')
all_tables = all_html.find_all('div', {'class':'Layout__Main-eg6k6r-1 cSyfD'})
print('Found '+ str(len(all_tables)) + 'tables')

table1_goalkeepers = all_tables[0]
rows_goalkeeper = table1_goalkeepers.tbody
print('Goalkeepers: \n')
print(rows_goalkeeper)

table3_defenders = all_tables[1]
print('Defenders \n')
rows_defencders = table3_defenders.tbody
print(rows_defencders)


browser.quit()

Sample output:

Goalkeepers: 

<tbody><tr><td>Alisson</td><td>Liverpool</td><td>99</td><td>£6.2</td></tr><tr><td>Ederson</td><td>Man City</td><td>88</td><td>£6.0</td></tr><tr><td>Kepa</td><td>Chelsea</td><td>72</td><td>£5.4</td></tr><tr><td>Schmeichel</td><td>Leicester</td><td>122</td><td>£5.4</td></tr><tr><td>de Gea</td><td>Man Utd</td><td>105</td><td>£5.3</td></tr><tr><td>Lloris</td><td>Spurs</td><td>56</td><td>£5.3</td></tr><tr><td>Henderson</td><td>Sheffield Utd</td><td>135</td><td>£5.3</td></tr><tr><td>Pickford</td><td>Everton</td><td>93</td><td>£5.2</td></tr><tr><td>Patrício</td><td>Wolves</td><td>122</td><td>£5.2</td></tr><tr><td>Dubravka</td><td>Newcastle</td><td>124</td><td>£5.1</td></tr><tr><td>Leno</td><td>Arsenal</td><td>114</td><td>£5.0</td></tr><tr><td>Guaita</td><td>Crystal Palace</td><td>122</td><td>£5.0</td></tr><tr><td>Pope</td><td>Burnley</td><td>128</td><td>£4.9</td></tr><tr><td>Foster</td><td>Watford</td><td>113</td><td>£4.9</td></tr><tr><td>Fabianski</td><td>West Ham</td><td>61</td><td>£4.9</td></tr><tr><td>Caballero</td><td>Chelsea</td><td>7</td><td>£4.8</td></tr><tr><td>Ryan</td><td>Brighton</td><td>105</td><td>£4.7</td></tr><tr><td>Bravo</td><td>Man City</td><td>11</td><td>£4.7</td></tr><tr><td>Grant</td><td>Man Utd</td><td>0</td><td>£4.7</td></tr><tr><td>Romero</td><td>Man Utd</td><td>0</td><td>£4.6</td></tr><tr><td>Krul</td><td>Norwich</td><td>94</td><td>£4.6</td></tr><tr><td>Mignolet</td><td>Liverpool</td><td>0</td><td>£4.5</td></tr><tr><td>McCarthy</td><td>Southampton</td><td>74</td><td>£4.5</td></tr><tr><td>Ramsdale</td><td>Bournemouth</td><td>97</td><td>£4.5</td></tr><tr><td>Fahrmann</td><td>Norwich</td><td>1</td><td>£4.4</td></tr><tr><td>Roberto</td><td>West Ham</td><td>18</td><td>£4.4</td></tr><tr><td>Verrips</td><td>Sheffield Utd</td><td>0</td><td>£4.4</td></tr><tr><td>Kelleher</td><td>Liverpool</td><td>0</td><td>£4.4</td></tr><tr><td>Reina</td><td>Aston Villa</td><td>19</td><td>£4.4</td></tr><tr><td>Nyland</td><td>Aston Villa</td><td>11</td><td>£4.3</td></tr><tr><td>Heaton</td><td>Aston Villa</td><td>59</td><td>£4.3</td></tr><tr><td>Darlow</td><td>Newcastle</td><td>0</td><td>£4.3</td></tr><tr><td>Eastwood</td><td>Sheffield Utd</td><td>0</td><td>£4.3</td></tr><tr><td>Steer</td><td>Aston Villa</td><td>1</td><td>£4.3</td></tr><tr><td>Moore</td><td>Sheffield Utd</td><td>1</td><td>£4.3</td></tr><tr><td>Peacock-Farrell</td><td>Burnley</td><td>0</td><td>£4.3</td></tr></tbody>

Solution 4:

This page uses JavaScript to add data but BeautifulSoup can't run JavaScript.

You can use Selenium to control web browser which can run JavaScript

Or you can check in DevTools in Firefox/Chrome (tab: Network) what url is used by JavaScript to get data from server and use it with urllib to get these data.

I choose this method (manually searching in DevTools).

I found that JavaScript gets these data in JSON format from

https://fantasy.premierleague.com/api/bootstrap-static/

Because I get data in JSON so I can convert to Python list/dictionary using module json and I don't need BeautifulSoup.

It needs more manual work to recognize structure of data but it gives more data then table on page.

Here all data about first player on the list Alisson

chance_of_playing_next_round = 100chance_of_playing_this_round = 100code = 116535cost_change_event = 0cost_change_event_fall = 0cost_change_start = 2cost_change_start_fall = -2dreamteam_count = 1element_type = 1ep_next = 11.0ep_this = 11.0event_points = 10first_name = Alisson
 form = 10.0id = 189in_dreamteam = Falsenews = 
 news_added = 2020-03-06T14:00:17.901193Z
 now_cost = 62photo = 116535.jpg
 points_per_game = 4.7second_name = Ramses Becker
 selected_by_percent = 9.2special = Falsesquad_number = None
 status = a
 team = 10team_code = 14total_points = 99transfers_in = 767780transfers_in_event = 9339transfers_out = 2033680transfers_out_event = 2757value_form = 1.6value_season = 16.0web_name = Alisson
 minutes = 1823goals_scored = 0assists = 1clean_sheets = 11goals_conceded = 12own_goals = 0penalties_saved = 0penalties_missed = 0yellow_cards = 0red_cards = 1saves = 48bonus = 9bps = 439influence = 406.2creativity = 10.0threat = 0.0ict_index = 41.7influence_rank = 135influence_rank_type = 18creativity_rank = 411creativity_rank_type = 8threat_rank = 630threat_rank_type = 71ict_index_rank = 294ict_index_rank_type = 18

There are also information about teams, etc.

Code:

from urllib.request import urlopen
import json

#url = 'https://fantasy.premierleague.com/player-list'
url = 'https://fantasy.premierleague.com/api/bootstrap-static/'

text = urlopen(url).read().decode()
data = json.loads(text)

print('\n--- element type ---\n')        

#print(data['element_types'][0])for item in data['element_types']:
    print(item['id'], item['plural_name'])

print('\n--- Goalkeepers ---\n')        

number = 0for item in data['elements']:
        
    if item['element_type'] == 1: # Goalkeepers
        number += 1print('---', number, '---')
        print('type        :', data['element_types'][item['element_type']-1]['plural_name'])
        print('first_name  :', item['first_name'])
        print('second_name :', item['second_name'])
        print('total_points:', item['total_points'])
        print('team        :', data['teams'][item['team']-1]['name'])
        print('cost        :', item['now_cost']/10)

        if item['first_name'] == 'Alisson':
            for key, value in item.items():
                print('    ', key, '=',value)

Result:

---elementtype---1Goalkeepers2Defenders3Midfielders4Forwards---Goalkeepers------1---type        :Goalkeepersfirst_name  :Berndsecond_name :Lenototal_points:114team        :Arsenalcost        :5.0---2---type        :Goalkeepersfirst_name  :Emilianosecond_name :Martíneztotal_points:1team        :Arsenalcost        :4.2---3---type        :Goalkeepersfirst_name  :Ørjansecond_name :Nylandtotal_points:11team        :AstonVillacost        :4.3---4---type        :Goalkeepersfirst_name  :Tomsecond_name :Heatontotal_points:59team        :AstonVillacost        :4.3

Code gives data in different order then table but if you put it all in list or better in pandas DataFrame then you can sort it in different orders.

EDIT:

You can use pandas to get data from JSON

from urllib.request import urlopen
import json
import pandas as pd

#url = 'https://fantasy.premierleague.com/player-list'
url = 'https://fantasy.premierleague.com/api/bootstrap-static/'# read data from url and convert to Python's list/dictionary
text = urlopen(url).read().decode()
data = json.loads(text)

# create DataFrames
players = pd.DataFrame.from_dict(data['elements'])
teams   = pd.DataFrame.from_dict(data['teams'])

# divide by 10 to get `6.2` instead of `62`
players['now_cost'] = players['now_cost'] / 10# convert team's number to its name
players['team'] = players['team'].apply(lambda x: teams.iloc[x-1]['name'])

# filter players
goalkeepers = players[ players['element_type'] == 1 ]
defenders   = players[ players['element_type'] == 2 ]
# etc.# some informationsprint('\n--- goalkeepers columns ---\n')

print(goalkeepers.columns)

print('\n--- goalkeepers sorted by name ---\n')

sorted_data = goalkeepers.sort_values(['first_name'])

print(sorted_data[['first_name', 'team', 'now_cost']].head())

print('\n--- goalkeepers sorted by cost ---\n')

sorted_data = goalkeepers.sort_values(['now_cost'], ascending=False)

print(sorted_data[['first_name', 'team', 'now_cost']].head())

print('\n--- teams columns ---\n')

print(teams.columns)

print('\n--- teams ---\n')

print(teams['name'].head())

# etc.

Results

--- goalkeepers columns ---

Index(['chance_of_playing_next_round', 'chance_of_playing_this_round', 'code',
       'cost_change_event', 'cost_change_event_fall', 'cost_change_start',
       'cost_change_start_fall', 'dreamteam_count', 'element_type', 'ep_next',
       'ep_this', 'event_points', 'first_name', 'form', 'id', 'in_dreamteam',
       'news', 'news_added', 'now_cost', 'photo', 'points_per_game',
       'second_name', 'selected_by_percent', 'special', 'squad_number',
       'status', 'team', 'team_code', 'total_points', 'transfers_in',
       'transfers_in_event', 'transfers_out', 'transfers_out_event',
       'value_form', 'value_season', 'web_name', 'minutes', 'goals_scored',
       'assists', 'clean_sheets', 'goals_conceded', 'own_goals',
       'penalties_saved', 'penalties_missed', 'yellow_cards', 'red_cards',
       'saves', 'bonus', 'bps', 'influence', 'creativity', 'threat',
       'ict_index', 'influence_rank', 'influence_rank_type', 'creativity_rank',
       'creativity_rank_type', 'threat_rank', 'threat_rank_type',
       'ict_index_rank', 'ict_index_rank_type'],
      dtype='object')

--- goalkeepers sorted by name ---

    first_name         team  now_cost
94       Aaron  Bournemouth       4.5305     Adrián    Liverpool       4.0485       Alex  Southampton       4.5533      Alfie        Spurs       4.0291    Alisson    Liverpool       6.2--- goalkeepers sorted by cost ---

    first_name       team  now_cost
291    Alisson  Liverpool       6.2323    Ederson   Man City       6.0263     Kasper  Leicester       5.4169       Kepa    Chelsea       5.4515       Hugo      Spurs       5.3--- teams columns ---

Index(['code', 'draw', 'form', 'id', 'loss', 'name', 'played', 'points',
       'position', 'short_name', 'strength', 'team_division', 'unavailable',
       'win', 'strength_overall_home', 'strength_overall_away',
       'strength_attack_home', 'strength_attack_away', 'strength_defence_home',
       'strength_defence_away', 'pulse_id'],
      dtype='object')

--- teams ---0        Arsenal
1    Aston Villa
2    Bournemouth
3       Brighton
4        Burnley
Name: name, dtype: object

Html5 Info

Table Web Scraping Issues With Python

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Post a Comment for "Table Web Scraping Issues With Python"