Skip to content

Conversation

Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Mar 3, 2025

Description

  • Error handling if an empty metadata file was created

Issues

@Mantisus Mantisus requested a review from vdusek March 3, 2025 13:45
@Mantisus Mantisus self-assigned this Mar 3, 2025
@Mantisus Mantisus requested a review from Pijukatel March 3, 2025 17:15
Copy link
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add some test that would capture the nature of this scenario.

# Write the metadata to the file
file_path = os.path.join(entity_directory, METADATA_FILENAME)
f = await asyncio.to_thread(open, file_path, mode='wb')
mode = 'r+b' if os.path.exists(file_path) else 'wb'
Copy link
Collaborator

@Pijukatel Pijukatel Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file exists we can still open it with "wb" mode.If we open it with "r+b" mode, then we are not overwriting the whole file with new content but just changing the file starting from the beginning of the file. I doubt that is what we want.

Imagine file with content b"abc"
and you want to change it

with open(path, 'r+b') as f:
    f.write(b"x") 

-> b"xbc"

with open(path, 'wb') as f:
    f.write(b"x") 

-> b"x"

Maybe I have missed the point here, but so far it does not seem right to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I suppose it is because of wb + the fact that we work with files asynchronously that we have a situation where empty files are created.

We open the file in wb mode, deleting the contents and switching the asynchronous context. If the crawler is interrupted at this point, the file remains empty.

I settled on r+b mode because we always write formatted json files. These are also metadata files, so the fields in them are not changed. So I think it should work, since we will be overwriting the same number of lines each time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explanation. I think I am getting it now. But for sure this would be great to have in well commented test as it is by no means self-explanatory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After testing and thinking about it, I decided to go back to wb. It's quite a rare bug and it doesn't have a critical impact. But if we encounter some artifacts due to r+b, it can be much more painful to find the cause.

I added a test to reproduce the case of creating an empty file. Maybe we can use it if we try to find a more stable solution for overwriting files.

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM.

@Mantisus Mantisus requested a review from Pijukatel March 7, 2025 12:41
@Pijukatel Pijukatel merged commit b00876e into apify:master Mar 18, 2025
23 checks passed
@vdusek vdusek added the t-tooling Issues with this label are in the ownership of the tooling team. label Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

error starting up crawlee - json.decoder.JSONDecodeError: Expecting value
3 participants