Open Intelligence, will be an open source machine learning middleware API.
It will be an alternative to the closed source machine learning API's of Google, Amazon, Microsoft, etc. that everyone will be able to setup on premise.
Open Intelligence could be used by a file sharing system such as Nextcloud to analyse and transform document data (OCR for example), by Jellyfin for video recommendations and more.
The first feature is a proof of concept for my bachelor thesis, table transformation from images. Since I still have to graduate this year, the focus currently is the proof of concept and not Open Intelligence. This is why the server is just a basic debugging Flask REST API.
But if you have suggestions and/or questions, you can make an issue of course. Contributions are also welcome!
Wouldn't it be great if you could take a screenshot of a table on a website for example and automatically be able to paste the table data of the screenshot image into your spreadsheet software?
This is what table detection and transformation is about.
In this proof of concept, the image is transformed in mainly two steps:
- Table detection
- Table structure analysis and transformation
It is possible that the image not only contains the table we need to transform, but also a page title, paragraphs, drawings, etc.
This is why we need to first extract the table from the image with table detection.
For table detection, I'm currently using CascadeTabNet. With their algorithm, the authors of CascadeTabNet achieved 3rd rank in ICDAR 2019 post-competition results for table detection while attaining the best accuracy results for the ICDAR 2013 and TableBank dataset. Their paper can be found here.
Although CascadeTabNet provides an approach for table transformation, the results were not really convincing in my case. This is why I wrote my own algorithm for it.
For a bordered table (a table in which every cell has borders), we assign every text box a row and column index based on its own position compared to the position of the (vertical and horizontal) border lines. For example: if the y2 value of a text box is greater than the y value of the first horizontal line, but it's lower than the y value of the second horizontal line, then we can assume that this textbox belongs to the first row of the table.
In a borderless table, since there are no borders for every cell, the spacing between text boxes of different columns is significantly greater than the spacing between text boxes of the same column. Similarly, text boxes of different rows have significantly different y-values. Based on this observervations, we use hierarchical clustering to cluster the different text boxes, once for the column index and once again for the row index. I'm using hierarchical clustering because we can specify a maximum distance between the clusters. In other popular clustering algorithms, such as K-means clustering, you have to specify the amount of clusters. For this we would need to know the amount of columns and rows, which is information we don't have.
-
The
/detection/table
REST API method is called with the image as a file attachement, from the front end -
The file is validated by the Validator class. This is to prevent invalid or non-image files to be fed to the the rest of the pipeline
-
After this, the domain controller is called for further handling:
- The image is preprocessed: extra padding is added to the image. This is necessary because, while the table detector works well on documents with table(s) in it, it has difficulties recognising the table in a cropped image of a specific table.
- The tables get detected by CascadeTabNet and an isolated image of every table is send to the table processor
- The table processor finds all the text boxes with OCR (Tesseract), including the text value, position, width, height, etc of every text box in the image. All this information about the text boxes is held in a Pandas Dataframe
- The table image (and the text boxes in the case of borderless tables) is analysed to assign a row and column index to every text box
- Text boxes with identical row and column values are aggregated
- The text box dataframe gets transformed into a new dataframe containing only the text information, since other values like the x, y, width, heigh, etc values of the text boxes are not needed anymore.
- This new dataframe is transformed into JSON data
- Steps 6-10 are repeated for the image of every detected table.
-
The final result of step 3 is send back to the REST API server, which sends the analysis result to the client that made the API request in the first place
- Nvidia GPU
- Cuda 10.0 or higher (for table detection)
- Linux or MacOS (please don't try it on Windows, it won't work, save yourself a few hours of vainless pain)
If you don't need the table detection, then you won't need a GPU and you will be able to use the software on Windows too (see Note).
-
Install MMdetection, based on the instructions provided in the README.md in CascadeTabNet. Cuda 10.0 is mentioned in the installation section, but Cuda 11.0 works just as fine. Note: the machine learning model CascadeTabNet is using, was created using the old 1.2 version of MMdetection. I would appreciate it if someone could recreate the model with the current version 2 of MMdetection, which does not support models created with version 1.x. The nice thing about version 2 of MMdetection is that it supports inference without GPU!
-
Clone the respository:
git clone https://github.com/nazarimilad/open-intelligence-backend
-
Go to the cloned repository:
cd open-intelligence-backend
-
Create a new Python virtual environmnent:
python3 -m venv env
-
Activate the new virtual environment:
source env/bin/activate
-
Install the requirements:
pip3 install -r requirements.txt
-
Start the server with
python3 main.py
-
???
-
Profit
-
When you don't need the software anymore, you can deactivate the environment:
deactivate
If you already have the image of an isolated table, then you can use the structure analysis and transformation code directly:
python3 table_transformer.py --table-type borderless -i datasets/all_tables/2.png
-
The code for the front end, as seen in the demo gif, can be found here
-
The server will listen on default port 5000. You can change the host and port in
main.py
-
You will notice that the structure analysis code of CascadeTabNet is still there, altough it's not used. This is because I'm keeping it for comparison purposes. I'll be cleaning the code later.
-
I'm a complete beginner in the computer vision domain. I'm certain that my code for horizontal and vertical line detection is crap, so I would really appreciate it if someone with good knowledge in computer vision could give some feedback on my work.