-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistencies in Parquet download files #73
Comments
|
|
The readme has a link to the schema doc, perhaps not prominently enough, where you can find the schedule doc. I just pulled the The error is coming from the extract step - 2020 is the only year with two different schedule files, one for the original and one for the revised, and the glob is picking up both. I'll fix that in the code and the data should be correct in the next release. Thanks for the catch! If this is the only 2020-related bug in the data I'll be pleasantly surprised. |
You're right, I amend my statement to: 'postponement_indicator' is all-null in 2021 and 2022 only. When grouping the entire schedules table by year, I see zero postponements for 2021 and 2022, while every other year has at least 20 or so. |
Yep, that's definitely a bug - this time from gaps in the data that have since been corrected by Retrosheet. Those years will be filled on the next release (which I'm guessing I'll have up here in December). |
Great, thanks for your diligence and quick responses! |
Describe the bug
I downloaded the parquet files directly from the OneDrive link included in the repo's readme, and have been reading them with pyarrow and pandas. In digging though the data (2022 only, so far), I discovered two problems, one of which led me to the other.
First (and maybe more of a feature request than a bug) is the fact that, best I can tell, there is no easy way to tell whether games are regular season or postseason/allstar/other from any of the parquet files (I would particularly expect
game.parquet
,gamelog.parquet
, orschedule.parquet
to have an indicator column for this, but I do not see one).Second appears to be more of a bug. Contrary to no. 1 above,
schedule.parquet
only seems to include regular season games. Great, we can simply filtergame.parquet
and other files by whether or not the game exists inschedule.parquet
, right? Nope,schedule.parquet
seems to list games that never actually occurred. As an example:schedule.parquet
includes a game MIL @ CHN 2022-04-08, not part of a double header. However,games.parquet
(as well as baseball-reference and other sources) tell us that no such game exists! MIL @ CHN games occurred on 4/7/22 and 4/9/22, but not 4/8/22. I find 88 of these 'phantom games' inschedule.parquet
for 2022. And it's not a byproduct of the pandemic or lockout, I found these inconsistencies in every year I've looked at as far back as 2000.To Reproduce
Steps to reproduce the behavior:
schedule.parquet
andgame.parquet
as described above.Expected behavior
Games will be consistent across files, and a column easily delineates whether games are regular season or not.
I am entirely open to the fact that the files are exactly as intended and I am just missing something that explains the discrepancies, let me know if that is the case. Thanks!
The text was updated successfully, but these errors were encountered: