How to make a custom file system importer
To learn how to import table data from a file to the Memgraph database, head over to the How to import table data guide.
If you want to read from a file system not currently supported by
GQLAlchemy, or use a file type currently not readable, you can implement
your own by extending abstract classes FileSystemHandler
and DataLoader
,
respectively.
Info
You can also use this feature with Neo4j:
db = Neo4j(host="localhost", port="7687", username="neo4j", password="test")
Info
The features below aren’t included in the default GQLAlchemy installation. To use them, make sure to install GQLAlchemy with the relevant extras.
Implementing a new FileSystemHandler
For this guide, you will use the existing PyArrowDataLoader
capable of reading
CSV, Parquet, ORC and IPC/Feather/Arrow file formats. The PyArrow loader class
supports fsspec-compatible
file systems, so to implement an Azure Blob file system, you need to follow
these steps.
1. Extend the FileSystemHandler
class
This class holds the connection to the file system service and handles the path
from which the DataLoader
object reads files. To get a fsspec-compatible instance of
an Azure Blob connection, you can use the adlfs package. We are going to pass adlfs
-specific parameters such as account_name
and account_key
via kwargs. All that's left to do
is to override the get_path
method.
import adlfs
class AzureBlobFileSystemHandler(FileSystemHandler):
def __init__(self, container_name: str, **kwargs) -> None:
"""Initializes connection and data container."""
super().__init__(fs=adlfs.AzureBlobFileSystem(**kwargs))
self._container_name = container_name
def get_path(self, collection_name: str) -> str:
"""Get file path in file system."""
return f"{self._container_name}/{collection_name}"
2. Wrap the TableToGraphImporter
Next, you are going to wrap the TableToGraphImporter
class. This is optional since you can use the class directly, but it will be easier to use if we extend it with our custom importer class. Since we will be using PyArrow for data loading, you can extend the PyArrowImporter
class (which extends the TableToGraphImporter
) and make your own
PyArrowAzureBlobImporter
. This class should initialize the AzureBlobFileSystemHandler
and leave the rest to the PyArrowImporter
class. It should also receive a file_extension_enum
argument, which defines the file type that you are going to be reading.
class PyArrowAzureBlobImporter(PyArrowImporter):
"""PyArrowImporter wrapper for use with Azure Blob File System."""
def __init__(
self,
container_name: str,
file_extension_enum: PyArrowFileTypeEnum,
data_configuration: Dict[str, Any],
memgraph: Optional[Memgraph] = None,
**kwargs,
) -> None:
super().__init__(
file_system_handler=AzureBlobFileSystemHandler(
container_name=container_name, **kwargs
),
file_extension_enum=file_extension_enum,
data_configuration=data_configuration,
memgraph=memgraph,
)
3. Call translate()
Finally, to use your custom file system, initialize the Importer class and call
translate()
importer = PyArrowAzureBlobImporter(
container_name="test"
file_extension_enum=PyArrowFileTypeEnum.Parquet,
data_configuration=parsed_yaml,
account_name="your_account_name",
account_key="your_account_key",
)
importer.translate(drop_database_on_start=True)
If you want to see the full implementation of the AzureBlobFileSystem
and
other loader components, have a look at the
code. Feel free to create a PR on the
GQLAlchemy repository if you think of a new feature we could use. If you have
any more questions, join our community and ping us on
Discord.