File size: 5,399 Bytes
d6ea71e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
.. currentmodule:: socceraction.data.opta
=========================
Loading Opta data
=========================
`Opta's event stream data`_ comes in many different flavours. The
:class:`OptaLoader` class provides an API client enabling you to fetch
data from the following data feeds as Pandas DataFrames:
- Opta F1, F9 and F24 JSON feeds
- Opta F7 and F24 XML feeds
- StatsPerform MA1 and MA3 JSON feeds
- WhoScored.com JSON data
Currently, only loading data from local files is supported.
--------------------------
Connecting to a data store
--------------------------
First, you have to create a :class:`OptaLoader` object and configure it
for the data feeds you want to use.
Generic setup
=============
To set up a :class:`OptaLoader` you have to specify the root
directory, the filename hierarchy of the feeds and a parser for each feed.
For example::
from socceraction.data.opta import OptaLoader, parsers
api = OptaLoader(
root="data/opta",
feeds = {
"f7": "f7-{competition_id}-{season_id}-{game_id}.xml",
"f24": "f24-{competition_id}-{season_id}-{game_id}.xml",
}
parser={
"f7": parsers.F7XMLParser,
"f24": parsers.F24XMLParser
}
)
Since the loader uses the directory structure and file names to determine
which files should be parsed, the root directory should have a predefined
file hierarchy defined in the ``feeds`` argument. A wide range of file names
and directory structures are supported. However, the competition, season, and
game identifiers must be included in the file names to be able to locate the
corresponding files for each entity. For example, you might have grouped feeds
by competition and season as follows::
root
βββ competition_<competition_id>
β βββ season_<season_id>
β β βββ f7_<game_id>.xml
β β βββ f24_<game_id>.xml
β βββ ...
βββ ...
In this case, you can use the following feeds configuration::
feeds = {
"f7": "competition_{competition_id}/season_{season_id}/f7_{game_id}.xml",
"f24": "competition_{competition_id}/season_{season_id}/f24_{game_id}.xml",
}
.. note::
On Windows, the backslash character should be used as a path separator.
Furthermore, a few standard configurations are provided. These are listed below.
Opta F7 and F24 XML feeds
=========================
.. code-block:: python
from socceraction.data.opta import OptaLoader
api = OptaLoader(root="data/opta", parser="xml")
The root directory should have the following structure:
.. code-block::
root
βββ f7-{competition_id}-{season_id}.xml
βββ f24-{competition_id}-{season_id}-{game_id}.xml
βββ ...
Opta F1, F9 and F24 JSON feeds
==============================
.. code-block:: python
from socceraction.data.opta import OptaLoader
api = OptaLoader(root="data/opta", parser="json")
The root directory should have the following structure:
.. code-block::
root
βββ f1-{competition_id}-{season_id}.json
βββ f9-{competition_id}-{season_id}.json
βββ f24-{competition_id}-{season_id}-{game_id}.json
βββ ...
StatsPerform MA1 and MA3 JSON feeds
===================================
.. code-block:: python
from socceraction.data.opta import OptaLoader
api = OptaLoader(root="data/statsperform", parser="statsperform")
The root directory should have the following structure:
.. code-block::
root
βββ ma1-{competition_id}-{season_id}.json
βββ ma3-{competition_id}-{season_id}-{game_id}.json
βββ ...
WhoScored
=========
`WhoScored.com`_ is a popular website that provides detailed live match statistics.
These statistics are compiled from Opta's event feed, which can be scraped
from the website's source code using a library such as `soccerdata`_. Once you
have downloaded the raw JSON data, you can parse it using the :class:`OptaLoader`
with:
.. code-block:: python
from socceraction.data.opta import OptaLoader
api = OptaLoader(root="data/whoscored", parser="whoscored")
The root directory should have the following structure:
.. code-block::
root
βββ {competition_id}-{season_id}-{game_id}.json
βββ ...
Alternatively, the soccerdata library provides a wrapper that immediately
returns a :class:`OptaLoader` object for a scraped dataset.
.. code-block:: python
import soccerdata as sd
# Setup a scraper for the 2021/2022 Premier League season
ws = sd.WhoScored(leagues="ENG-Premier League", seasons=2021)
# Scrape all games and return a OptaLoader object
api = ws.read_events(output_fmt='loader')
.. warning::
Scraping data from WhoScored.com violates their terms of service. Legally,
scraping this data is therefore a grey area. If you decide to use this
data anyway, this is your own responsibility.
------------
Loading data
------------
Next, you can load the match event stream data and metadata by calling the
corresponding methods on the :class:`OptaLoader` object.
- :func:`OptaLoader.competitions()`
- :func:`OptaLoader.games()`
- :func:`OptaLoader.teams()`
- :func:`OptaLoader.players()`
- :func:`OptaLoader.events()`
.. _Opta's event stream data: https://www.statsperform.com/opta-event-definitions/
.. _soccerdata: https://soccerdata.readthedocs.io/en/latest/datasources/WhoScored.html
.. _WhoScored.com: https://www.whoscored.com/
|