900字范文 > python文件打开方式二进制或文本_如何在python中检测文件是否为二进制（非文本）？...

python文件打开方式二进制或文本_如何在python中检测文件是否为二进制（非文本）？...

时间：2022-12-10 19:38:32

19 个答案:

答案 0 :(得分：52)

>>> textchars = bytearray({7,8,9,10,12,13,27} | set(range(0x20, 0x100)) - {0x7f})

>>> is_binary_string = lambda bytes: bool(bytes.translate(None, textchars))

示例：

>>> is_binary_string(open('/usr/bin/python', 'rb').read(1024))

True

>>> is_binary_string(open('/usr/bin/dh_python3', 'rb').read(1024))

False

答案 1 :(得分：37)

您还可以使用mimetypes模块：

import mimetypes

...

mime = mimetypes.guess_type(file)

编译二进制mime类型列表相当容易。例如，Apache使用mime.types文件进行分发，您可以将其解析为一组列表，二进制文本和文本，然后检查mime是否在您的文本或二进制列表中。

答案 2 :(得分：10)

试试这个：

def is_binary(filename):

"""Return true if the given filename is binary.

@raise EnvironmentError: if the file does not exist or cannot be accessed.

@attention: found @ /topic/python/answers/21222-determine-file-type-binary-text on 6/08/

@author: Trent Mick

@author: Jorge Orpinel """

fin = open(filename, 'rb')

try:

CHUNKSIZE = 1024

while 1:

chunk = fin.read(CHUNKSIZE)

if '\0' in chunk: # found null byte

return True

if len(chunk) < CHUNKSIZE:

break # done

# A-wooo! Mira, python no necesita el "except:". Achis... Que listo es.

finally:

fin.close()

return False

答案 3 :(得分：9)

如果你正在使用带有utf-8的python3，它是直接的，只需在文本模式下打开文件，如果你得到UnicodeDecodeError就停止处理。 Python3将在文本模式下处理文件时使用unicode(以及二进制模式下的bytearray) - 如果您的编码无法解码任意文件，那么很可能会得到try:

with open(filename, "r") as f:

for l in f:

process_line(l)

except UnicodeDecodeError:

pass # Fond non-text data。

示例：

Company

答案 4 :(得分：8)

如果有帮助，很多二进制类型都以幻数开头。 Here is a list个文件签名。

答案 5 :(得分：5)

它非常简单，基于此stackoverflow问题中的代码。

你实际上可以用2行代码编写这个代码，但是这个软件包可以让你不必编写和彻底测试这两行代码和各种奇怪的文件类型，跨平台。

答案 6 :(得分：5)

这是一个使用Unix file命令的建议：

import re

import subprocess

def istext(path):

return (re.search(r':.* text',

subprocess.Popen(["file", '-L', path],

stdout=subprocess.PIPE).stdout.read())

is not None)

使用示例：

>>> istext('/etc/motd')

True

>>> istext('/vmlinuz')

False

>>> open('/tmp/japanese').read()

'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf\xe3\x80\x81\xe3\x81\xbf\xe3\x81\x9a\xe3\x81\x8c\xe3\x82\x81\xe5\xba\xa7\xe3\x81\xae\xe6\x99\x82\xe4\xbb\xa3\xe3\x81\xae\xe5\xb9\x95\xe9\x96\x8b\xe3\x81\x91\xe3\x80\x82\n'

>>> istext('/tmp/japanese') # works on UTF-8

True

它具有不能移植到Windows的缺点(除非你有类似file命令的东西)，并且必须为每个文件生成一个外部进程，这可能不太适合。

答案 7 :(得分：4)

通常你必须猜测。

如果文件中包含扩展，您可以将扩展视为一条线索。

您还可以识别已知的二进制格式，并忽略这些格式。

否则，请查看您拥有的不可打印ASCII字节的比例，并从中猜测。

你也可以尝试从UTF-8解码，看看是否能产生合理的输出。

答案 8 :(得分：3)

如果您不在Windows上，可以使用Python Magic来确定文件类型。然后你可以检查它是否是text / mime类型。

答案 9 :(得分：3)

一个较短的解决方案，带有UTF-16警告：

def is_binary(filename):

"""

Return true if the given filename appears to be binary.

File is considered to be binary if it contains a NULL byte.

FIXME: This approach incorrectly reports UTF-16 as binary.

"""

with open(filename, 'rb') as f:

for block in f:

if b'\0' in block:

return True

return False

答案 10 :(得分：2)

我们可以使用python本身来检查文件是否为二进制文件，因为如果我们尝试以文本模式打开二进制文件，则会失败

def is_binary(file_name):

try:

with open(file_name, 'tr') as check_file: # try open file in text mode

check_file.read()

return False

except: # if fail then file is non-text (binary)

return True

答案 11 :(得分：1)

这是一个函数，它首先检查文件是否以BOM开头，如果不是在初始8192字节内查找零字节：

import codecs

#: BOMs to indicate that a file is a text file even if it contains zero bytes.

_TEXT_BOMS = (

codecs.BOM_UTF16_BE,

codecs.BOM_UTF16_LE,

codecs.BOM_UTF32_BE,

codecs.BOM_UTF32_LE,

codecs.BOM_UTF8,

)

def is_binary_file(source_path):

with open(source_path, 'rb') as source_file:

initial_bytes = source_file.read(8192)

return not any(initial_bytes.startswith(bom) for bom in _TEXT_BOMS) \

and b'\0' in initial_bytes

从技术上讲，检查UTF-8 BOM是不必要的，因为它不应包含任何实际用途的零字节。但由于它是一种非常常见的编码，因此在开始时检查BOM的速度要快，而不是将所有8192字节扫描为0。

答案 12 :(得分：1)

我想最好的解决方案是使用guess_type函数。它包含一个包含几个mimetypes的列表，您还可以包含自己的类型。

这是我为解决问题所做的脚本：

from mimetypes import guess_type

from mimetypes import add_type

def __init__(self):

self.__addMimeTypes()

def __addMimeTypes(self):

add_type("text/plain",".properties")

def __listDir(self,path):

try:

return listdir(path)

except IOError:

print ("The directory {0} could not be accessed".format(path))

def getTextFiles(self, path):

asciiFiles = []

for files in self.__listDir(path):

if guess_type(files)[0].split("/")[0] == "text":

asciiFiles.append(files)

try:

return asciiFiles

except NameError:

print ("No text files in directory: {0}".format(path))

finally:

del asciiFiles

它位于Class内部，您可以根据代码的ustructure看到它。但是你几乎可以改变你想在应用程序中实现它的东西。

它使用起来非常简单。

方法getTextFiles返回一个列表对象，其中包含您在路径变量中传递的目录中的所有文本文件。

答案 13 :(得分：1)

我来到这里寻找完全相同的东西 - 标准库提供的全面解决方案来检测二进制文本或文本。在查看了人们建议的选项之后，nix file命令看起来是最好的选择(我只是为linux boxen开发)。其他一些人使用文件发布了解决方案，但我认为它们不必要地复杂化，所以这就是我提出的：

def test_file_isbinary(filename):

cmd = shlex.split("file -b -e soft '{}'".format(filename))

if subprocess.check_output(cmd)[:4] in {'ASCI', 'UTF-'}:

return False

return True

应该不言而喻，但是调用此函数的代码应该确保在测试之前可以读取文件，否则会错误地将文件检测为二进制文件。

答案 14 :(得分：1)

尝试使用当前维护的python-magic，它与@Kami Kisiel的答案中的模块不同。这确实支持包括Windows在内的所有平台，但是您将需要libmagic二进制文件。这在自述文件中进行了说明。

与mimetypes模块不同，它不使用文件扩展名，而是检查文件的内容。

>>> import magic

>>> magic.from_file("testdata/test.pdf", mime=True)

'application/pdf'

>>> magic.from_file("testdata/test.pdf")

'PDF document, version 1.2'

>>> magic.from_buffer(open("testdata/test.pdf").read(1024))

'PDF document, version 1.2'

答案 15 :(得分：0)

如果文件包含NULL character，大多数程序都认为文件是二进制文件(任何不是＆＃34;面向行的文件＆＃34;)。

以下是用Python实现的pp_fttext()(pp_sys.c)的perl版本：

import sys

PY3 = sys.version_info[0] == 3

# A function that takes an integer in the 8-bit range and returns

# a single-character byte object in py3 / a single-character string

# in py2.

int2byte = (lambda x: bytes((x,))) if PY3 else chr

_text_characters = (

b''.join(int2byte(i) for i in range(32, 127)) +

b'\n\r\t\f\b')

def istextfile(fileobj, blocksize=512):

""" Uses heuristics to guess whether the given file is text or binary,

by reading a single block of bytes from the file.

If more than 30% of the chars in the block are non-text, or there

are NUL ('\x00') bytes in the block, assume this is a binary file.

"""

block = fileobj.read(blocksize)

if b'\x00' in block:

# Files with null bytes are binary

return False

elif not block:

# An empty file is considered a valid text file

return True

# Use translate's 'deletechars' argument to efficiently remove all

# occurrences of _text_characters from the block

nontext = block.translate(None, _text_characters)

return float(len(nontext)) / len(block) <= 0.30

另请注意，此代码编写为无需更改即可在Python 2和Python 3上运行。

答案 16 :(得分：0)

更简单的方法是使用\x00运算符检查文件是否包含NULL字符(in)，例如：

b'\x00' in open("foo.bar", 'rb').read()

见下面的完整示例：

#!/usr/bin/env python3

import argparse

if __name__ == '__main__':

parser = argparse.ArgumentParser()

parser.add_argument('file', nargs=1)

args = parser.parse_args()

with open(args.file[0], 'rb') as f:

if b'\x00' in f.read():

print('The file is binary!')

else:

print('The file is not binary!')

样本用法：

$ ./is_binary.py /etc/hosts

The file is not binary!

$ ./is_binary.py `which which`

The file is binary!

答案 17 :(得分：0)

你是unix吗？如果是这样，那么试试：

isBinary = os.system("file -b" + name + " | grep text > /dev/null")

shell返回值被反转(0表示正常，因此如果找到“text”则返回0，而在Python中则返回False表达式。)

答案 18 :(得分：0)

on * NIX：

如果您有权访问file shell命令，shlex可以帮助使子进程模块更加可用：

from os.path import realpath

from subprocess import check_output

from shlex import split

filepath = realpath('rel/or/abs/path/to/file')

assert 'ascii' in check_output(split('file {}'.format(filepth).lower()))

或者，您也可以在for循环中使用以下内容来获取当前目录中所有文件的输出：

import os

for afile in [x for x in os.listdir('.') if os.path.isfile(x)]:

assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

或所有子目录：

for curdir, filelist in zip(os.walk('.')[0], os.walk('.')[2]):

for afile in filelist:

assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。