使用说明

`方法一：(需要会python或java)`

鉴于当前没有方便提取财报PDF中的财务数据的工具，于是研究了一下各方面资料写了两种语言提取的小工具，即：

备注：若需要提取10页以上的PDF转为excel，可以自行修改代码for循环使用spire.pdf-3.8.5.jar提供的方法即可(免费API限制使用10页)

========================================================================

更新2021-11-19

`方法二：(个人推荐，不用写代码)`

找到一款超级好用，更适合小白的开源PDF提取表格转化excel工具，下载安装即可。刚刚使用一下该工具对PDF中表格提取并转化为excel文件的准确率达到100%

使用条件：首先需要安装Java环境，然后下载windows的tabula-win.zip安装包解压后双击tabula.exe即可~
备注：安装java环境可以自行百度，操作教程太多了。实在不会，我附上一个参考教程链接吧：win10安装java8

Windows
1. Windows & Linux users will need a copy of Java installed. You can download Java here. (Java is included in the Mac version.)
2. Download tabula-win.zip from https://tabula.technology/. Unzip the whole thing and open the tabula.exe file inside. A browser should automatically open to http://127.0.0.1:8080/ . If not, open your web browser of choice and visit that link.
To close Tabula, just go back to the console window and press "Control-C" (as if to copy).

========================================================================

更新2022-03-24

`方法三：(需要会python)`

对于复杂的表格，使用tabula工具提取表格时也会有部分格式混乱。所以找到一款基于tabula-java工具包装的tabula-py依赖库

Github地址: https://github.com/chezou/tabula-py

python环境安装依赖库：pip install tabula-py

通过tabula-py依赖库提供的API进行读取PDF提取表格数据，然后按照自己的要求进行清洗即可，开发环境要求如下：

Java 8+
Python 3.7+

Example

tabula-py enables you to extract tables from a PDF into a DataFrame, or a JSON. It can also extract tables from a PDF and save the file as a CSV, a TSV, or a JSON.

import tabula

# Read pdf into list of DataFrame
dfs = tabula.read_pdf("test.pdf", pages='all')

# Read remote pdf into list of DataFrame
dfs2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")

# convert PDF into CSV file
tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')

# convert all PDFs in a directory
tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')

See example notebook for more details. I also recommend to read the tutorial article.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
demo		demo
java/ParsePDF		java/ParsePDF
python/parsePDF		python/parsePDF
LICENSE		LICENSE
ParsePDF.jar		ParsePDF.jar
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

使用说明

`方法一：(需要会python或java)`

更新2021-11-19

`方法二：(个人推荐，不用写代码)`

Windows

更新2022-03-24

`方法三：(需要会python)`

Example

About

Releases

Packages

Languages

License

ARTAvrilLavigne/ExtractFinancialStatement

Folders and files

Latest commit

History

Repository files navigation

使用说明

方法一：(需要会python或java)

更新2021-11-19

方法二：(个人推荐，不用写代码)

Windows

更新2022-03-24

方法三：(需要会python)

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`方法一：(需要会python或java)`

`方法二：(个人推荐，不用写代码)`

`方法三：(需要会python)`

Packages